A method of searching for similar code sequences in executable binary files using a featureless approach
Yumaganov A.S., Myasnikov V.V.

 

Samara National Research University, Samara, Russia

Full text of article: Russian language.

 PDF

Abstract:
The work is devoted to solving a problem of searching for similar code sequences in executable binary files. The proposed method involves partitioning the processor instructions into functional groups, forming a given function’s primary description by commands position in its body, followed by generating the function’s intermediate description through its comparison with the functions from a "base" library. With the dimensionality of the resulting vector reduced in this way, the resulting final description is then used to perform the search. Results of the experimental study demonstrate the operability of the proposed method. The efficiency of the proposed method is compared against existing methods of searching for similar code sequences. We also provide recommendations on the choice of parameters of the developed method.

Keywords:
searching, code sequence, featureless recognition.

Citation:
Yumaganov AS, Myasnikov VV. A method of searching for similar code sequences in executable binary files using a featureless approach. Computer Optics 2017; 41(5): 756-764. DOI: 10.18287/2412-6179-2017-41-5-756-764.

References:

  1. Zaimi A, Ampatzoglou A, Triantafyllidou N, Chatzigeorgiou A, Mavridis A, Chaikalis T, Deligiannis I, Sfetsos P, Stamelos I. An empirical study on the reuse of third-party libraries in open-source software development. Proceedings of the 7th Balkan Conference on Informatics Conference 2015: 4. DOI:10.1145/2801081.2801087.
  2. IDA F.L.I.R.T Technology: In-Depth. Source: <https://www.hex-rays.com/products/ida/tech/flirt/in_depth.shtml>.
  3. Myles G, Collberg C. K-gram based software birthmarks. Proceedings of the 2005 ACM Symposium on Applied Computing 2005; 314-318. DOI: 10.1145/1066677.1066753.
  4. Flake H. Structural comparison of executable objects. Proceedings of Detection of Intrusions and Malware & Vulnerability Assessment 2004; 161-173.
  5. Kruegel C, Kirda E, Mutz D, Robertson W, Vigna G. Polymorphic worm detection using structural information of executables. RAID'05 2005: 207-226. DOI: 10.1007/11663812_11.
  6. Khoo WM, Mycroft A, Anderson R. Rendezvous: A search engine for binary code. MSR '13 2013; 329-338. DOI: 10.1109/MSR.2013.6624046.
  7. Yumaganov AS, Myasnikov VV. Similarity search over program code sequences using featureless pattern recognition techniques. CEUR Workshop Proceedings 2016; 1638: 437-443. DOI: 10.18287/1613-0073-2016-1638-437-443.
  8. Yumaganov AS, Myasnikov VV. Comparison of the ways of the program's code initial description in the problem of similar code sequences search [In Russian]. Proceedings of the III International Conference and Youth School ITNT-2017. Samara: "Novaya Tehnika" Publisher; 2017: 940-945.
  9. x86 Assembly language reference manual. Source: <https://docs.oracle.com/cd/E19253-01/817-5477/817-5477.pdf>.
  10. Fukunaga K. Introduction to statistical pattern recognition. 2nd ed. San Diego, London, San Francisco: Academic Press; 1990. ISBN: 978-0-08-047865-4.
  11. Hirschberg DS. A linear space algorithm for computing maximal common subsequences. Communications of the ACM 1975; 18(6): 341-343. DOI: 10.1145/360825.360861.
  12. Pearson K. On lines and planes of closest fit to systems of points in space. Philosophical Magazine 1901; 2: 559-572.
  13. Duin RPW, de Ridder D, Tax DMJ. Featureless pattern classification. Kybernetica 1998; 34(4): 399-404.
  14. Buckland MK, Gey FC. The relationship between recall and precision. J Am Soc Inf Sci 1994; 45(1): 12-19. DOI: 10.1002/(SICI)1097-4571(199401)45:1<12::AID-ASI2>3.0.CO;2-L.
  15. Powers DMW. Evaluation: From precision, recall and f-measure to ROC, informedness, markedness & correlation. Journal of Machine Learning Technologies 2011; 2(1): 37-63.
  16. LibTIFF – TIFF library and utilities. Source: <http://www.libtiff.org/>.
  17. Marron JS, Nolan D. Canonical kernels for density estimation. Statistics & Probability Letters 1989; 7(3): 195-199. DOI: 10.1016/0167-7152(88)90050-8.
  18. Curl – Command line tool and library for transferring data with URLs. Source: <https://curl.haxx.se/>.

© 2009, IPSI RAS
Institution of Russian Academy of Sciences, Image Processing Systems Institute of RAS, Russia, 443001, Samara, Molodogvardeyskaya Street 151; E-mail: journal@computeroptics.ru ; Phones: +7 (846 2) 332-56-22, Fax: +7 (846 2) 332-56-20