A SEMANTIC METRIC LEARNING APPROACH FOR ENHANCED MALWARE SIMILARITY SEARCH

Yuki Nakamura; Hiroshi Tanaka

doi:10.55640/ijidml-v02i01-01

Open Access

A SEMANTIC METRIC LEARNING APPROACH FOR ENHANCED MALWARE SIMILARITY SEARCH

https://doi.org/10.55640/ijidml-v02i01-01

pdf

Yuki Nakamura ¹ , Hiroshi Tanaka ¹ ,

⁴ Graduate School of Information Science and Technology, University of Tokyo, Japan

⁴ Department of Computer Science, Kyoto University, Japan

Abstract

Identifying and categorizing malware variants efficiently is a critical capability for modern cybersecurity systems tasked with defending against rapidly evolving threats. Traditional similarity search techniques often rely on syntactic or signature-based comparisons, which are insufficient for capturing deeper semantic relationships among malware samples, especially in the presence of obfuscation and polymorphism. This research introduces a semantic metric learning approach for enhanced malware similarity search that leverages deep neural embeddings trained to capture high-level behavioral and structural characteristics of malicious code. By employing a supervised metric learning framework with contrastive and triplet loss functions, the model learns a discriminative embedding space in which semantically similar malware instances are mapped closer together while dissimilar samples are pushed farther apart. Experimental evaluations on benchmark malware datasets demonstrate that the proposed method significantly outperforms traditional hashing and signature-based approaches in retrieval precision, recall, and mean average precision. The results underscore the potential of semantic metric learning to advance malware analysis, facilitate threat hunting, and improve incident response workflows by enabling more accurate and scalable similarity-based retrieval.

Keywords

Malware similarity search, Semantic metric learning, Deep embeddings, Contrastive learning

References

📄 Chen, Z., Roussopoulos, M., Liang, Z., Zhang, Y., Chen, Z., and Delis, A. (2012). Malware characteristics and threats on the internet ecosystem. Journal of Systems and Software, 85(7):1650–1672.

📄 Park, Y., Reeves, D., Mulukutla, V., and Sundaravel, B. (2010). Fast malware classification by automated behavioral graph matching. In Proceedings of the Sixth Annual Workshop on Cyber Security and Information Intelligence Research. ACM, page 45.

📄 Bai, J., Wang, J., and Zou, G. (2014). A malware detection scheme based on mining format information. The Scientific World Journal, 2014.

📄 Yuan, Z., Lu, Y., Wang, Z., and Xue, Y. (2014). Droid-sec: deep learning in android malware detection. In ACM SIGCOMM Computer Communication Review, volume 44, no. 4. ACM, pages 371–372.

📄 Saxe, J. and Berlin, K. (2015). Deep neural network-based malware detection using two-dimensional binary program features. In Malicious and Unwanted Software (MALWARE), 2015 10th International Conference on. IEEE, pages 11–20.

📄 Jiang, X., Wang, X., and Xu, D. (2007). Stealthy malware detection through vmm-based out-of-the-box semantic view reconstruction.15 In Proceedings of the 14th ACM Conference on Computer and Communications Security. ACM, pages 128–138.

📄 Yan, L.-K. and Yin, H. (2012). Droidscope: Seamlessly reconstructing the os and dalvik semantic views for dynamic android malware analysis.16 In USENIX Security Symposium, 2012, pages 569–584.

📄 Reina, A., Fattori, A., and Cavallaro, L. (2013). A system call-centric analysis and stimulation technique to automatically reconstruct android malware behaviors. EuroSec, April.

📄 Christodorescu, M., Jha, S., Seshia, S. A., Song, D., and Bryant, R. E. (2005).17 Semantics-aware malware detection. In Security and Privacy, 2005 IEEE Symposium on. IEEE, pages 32–46.

📄 Zhang, M., Duan, Y., Yin, H., and Zhao, Z. (2014). Semantics-aware android malware classification using weighted contextual api dependency graphs. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security. ACM, pages 1105–1116.

📄 Jang, J., Brumley, D., and Venkataraman, S. (2011). Bitshred: Feature hashing malware for scalable triage and semantic analysis. In Proceedings of the 18th ACM Conference on Computer and Communications Security. ACM, pages 309–320.

📄 Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

📄 Pennington, J., Socher, R., and Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.

📄 Wen, Y., Zhang, K., Li, Z., and Qiao, Y. (2016). A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision. Springer, pages 499–515.

📄 Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012, pages 1097–1105.

📄 Datta, R., Joshi, D., Li, J., and Wang, J. Z. (2008). Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys (Csur), 40(2):5.

📄 Yu, J., Tao, D., Wang, M., and Rui, Y. (2015). Learning to rank using user clicks and visual features for image retrieval. IEEE Transactions on Cybernetics, 45(4):767–779.

📄 Schedl, M., Gómez, E., Urbano, J., et al. (2014). Music information retrieval: Recent developments and applications. Foundations and Trends® in Information Retrieval, 8(2-3):127–261.

📄 Goeuriot, L., Jones, G. J., Kelly, L., Muller, H., and Zobel, J. (2016). Medical information retrieval: Introduction to the special issue. Information Retrieval Journal, 19(1-2):1–5.

📄 Mourão, A., Martins, F., and Magalhães, J. (2015). Multimodal medical information retrieval with unsupervised rank fusion. Computerized Medical Imaging and Graphics, 39:35–45.

📄 Santos, I., Ugarte-Pedrero, X., Brezo, F., Bringas, P. G., and Gómez-Hidalgo, J. M. (2013). Noa: An information retrieval based malware detection system. Computing and Informatics, 32(1):145–174.

📄 Lashkari, A. H., Mahdavi, F., and Ghomi, V. (2009). A boolean model in information retrieval for search engines. In Information Management and Engineering, 2009. ICIME’09. International Conference on. IEEE, pages 385–389.

📄 Guo, J., Fan, Y., Ai, Q., and Croft, W. B. (2016). A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, pages 55–64.

📄 Liu, T.-Y., et al. (2009). Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval, 3(3):225–331.

📄 Diaz, F., Mitra, B., and Craswell, N. (2016). Query expansion with locally-trained word embeddings. arXiv preprint arXiv:1605.07891.

📄 Huang, P.-S., He, X., Gao, J., Deng, L., Acero, A., and Heck, L. (2013). Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management. ACM, pages 2333–2338.

📄 Roy, D., Paul, D., Mitra, M., and Garain, U. (2016). Using word embeddings for automatic query expansion. arXiv preprint arXiv:1606.07608.

📄 Mitra, B., Nalisnick, E., Craswell, N., and Caruana, R. (2016). A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137.

📄 Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97.

📄 Severyn, A. and Moschitti, A. (2015). Learning to rank short text pairs with convolutional deep neural networks. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, pages 373–382.

📄 Wan, J., Wang, D., Hoi, S. C. H., Wu, P., Zhu, J., Zhang, Y., and Li, J. (2014). Deep learning for content-based image retrieval: A comprehensive study. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, pages 157–166.

📄 Sun, Y., Chen, Y., Wang, X., and Tang, X. (2014). Deep learning face representation by joint identification-verification. In Advances in Neural Information Processing Systems, 2014, pages 1988–1996.

📄 Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jegou, H., and Mikolov, T. (2016). Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.

📄 Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern Information Retrieval. ACM press New York, vol. 463.

📄 Total, V. (2012). Virustotal-free online virus, malware and url scanner. Online: https://www. virustotal. com/en.

📄 Nataraj, L., Karthikeyan, S., Jacob, G., and Manjunath, B. (2011). Malware images: visualization and automatic classification. In Proceedings of the 8th International Symposium on Visualization for Cyber Security. ACM, page 4.

📄 Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift.18 arXiv preprint arXiv:1502.03167.

📄 Xu, B., Wang, N., Chen, T., and Li, M. (2015). Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853.

📄 Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted boltz-mann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 807–814.

📄 Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization.19 arXiv preprint arXiv:1412.6980.

📄 Nataraj, L., Kirat, D., Manjunath, B., and Vigna, G. (2013). Sarvam: Search and retrieval of malware. In Proceedings of the Annual Computer Security Conference (ACSAC) Worshop on Next Generation Malware Attacks and Defense (NGMAD).

📄 Upchurch, J. and Zhou, X. (2015). Variant: a malware similarity testing framework. In Malicious and Unwanted Software (MALWARE), 2015 10th International Conference on. IEEE, pages 31–39.

📄 Palahan, S., Babić, D., Chaudhuri, S., and Kifer, D. (2013). Extraction of statistically significant malware behaviors. In Proceedings of the 29th Annual Computer Security Applications Conference. ACM, pages 69–78.

📄 Mitra, B., Diaz, F., and Craswell, N. (2017). Learning to match using local and distributed representations of text for web search. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, pages 1291–1299.

📄 Cohen, D. and Croft, W. B. (2016). End to end long short term memory networks for non-factoid question answering. In Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval. ACM, pages 143–146.

📄 Yeh, C.-K., Wu, W.-C., Ko, W.-J., and Wang, Y.-C. F. (2017). Learning deep latent space for multi-label classification. In Association for the Advancement of Artificial Intelligence, 2017, pages 2838–2844.

Most read articles by the same author(s)

Yuki Nakamura, Isabella Romano, HYBRID DEEP LEARNING FOR TEXT CLASSIFICATION: INTEGRATING BIDIRECTIONAL GATED RECURRENT UNITS WITH CONVOLUTIONAL NEURAL NETWORKS , International Journal of Intelligent Data and Machine Learning: Vol. 2 No. 04 (2025): Volume 02 Issue 04

International Journal of Intelligent Data and Machine Learning

A SEMANTIC METRIC LEARNING APPROACH FOR ENHANCED MALWARE SIMILARITY SEARCH

Abstract

Keywords

References

Most read articles by the same author(s)

Similar Articles