Related papers: A Unified Evaluation of Learning-Based Similarity Techniques for Malware Detection

A Unified Evaluation of Learning-Based Similarity Techniques for Malware Detection

URL: http://arxiv.org/abs/2602.15376v1
Date: Tue, 17 Feb 2026 06:16:23 GMT
Title: A Unified Evaluation of Learning-Based Similarity Techniques for Malware Detection
Authors: Udbhav Prasad, Aniesh Chawla,
Abstract summary: Similarity-based techniques enable approximate matching, allowing related byte sequences to produce measurably similar fingerprints.<n>Security researchers have proposed a range of approaches, including similarity digests and locality-sensitive hashes.<n>This paper presents a systematic comparison of learning-based classification and similarity methods using large, publicly available datasets.
Score: 0.0
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Cryptographic digests (e.g., MD5, SHA-256) are designed to provide exact identity. Any single-bit change in the input produces a completely different hash, which is ideal for integrity verification but limits their usefulness in many real-world tasks like threat hunting, malware analysis and digital forensics, where adversaries routinely introduce minor transformations. Similarity-based techniques address this limitation by enabling approximate matching, allowing related byte sequences to produce measurably similar fingerprints. Modern enterprises manage tens of thousands of endpoints with billions of files, making the effectiveness and scalability of the proposed techniques more important than ever in security applications. Security researchers have proposed a range of approaches, including similarity digests and locality-sensitive hashes (e.g., ssdeep, sdhash, TLSH), as well as more recent machine-learning-based methods that generate embeddings from file features. However, these techniques have largely been evaluated in isolation, using disparate datasets and evaluation criteria. This paper presents a systematic comparison of learning-based classification and similarity methods using large, publicly available datasets. We evaluate each method under a unified experimental framework with industry-accepted metrics. To our knowledge, this is the first reproducible study to benchmark these diverse learning-based similarity techniques side by side for real-world security workloads. Our results show that no single approach performs well across all dimensions; instead, each exhibits distinct trade-offs, indicating that effective malware analysis and threat-hunting platforms must combine complementary classification and similarity techniques rather than rely on a single method.

Related papers

Combine and Conquer: A Meta-Analysis on Data Shift and Out-of-Distribution Detection [30.377446496559635]
This paper introduces a universal approach to seamlessly combine out-of-distribution (OOD) detection scores. Our framework is easily for future developments in detection scores and stands as the first to combine decision boundaries in this context.
arXiv Detail & Related papers (2024-06-23T08:16:44Z)
Deep Learning Fusion For Effective Malware Detection: Leveraging Visual Features [12.431734971186673]
We investigate the power of fusing Convolutional Neural Network models trained on different modalities of a malware executable. We are proposing a novel multimodal fusion algorithm, leveraging three different visual malware features. The proposed strategy has a detection rate of 1.00 (on a scale of 0-1) in identifying malware in the given dataset.
arXiv Detail & Related papers (2024-05-23T08:32:40Z)
Semantic-embedded Similarity Prototype for Scene Recognition [12.236534954126155]
This paper proposes a semantic knowledge-based similarity prototype. It can help the scene recognition network achieve superior accuracy without increasing the computational cost in practice. Our similarity prototype enhances the performance of existing networks, all while avoiding any additional computational burden in practical deployments.
arXiv Detail & Related papers (2023-08-11T01:11:46Z)
Better Understanding Differences in Attribution Methods via Systematic Evaluations [57.35035463793008]
Post-hoc attribution methods have been proposed to identify image regions most influential to the models' decisions. We propose three novel evaluation schemes to more reliably measure the faithfulness of those methods. We use these evaluation schemes to study strengths and shortcomings of some widely used attribution methods over a wide range of models.
arXiv Detail & Related papers (2023-03-21T14:24:58Z)
Rethinking Clustering-Based Pseudo-Labeling for Unsupervised Meta-Learning [146.11600461034746]
Method for unsupervised meta-learning, CACTUs, is a clustering-based approach with pseudo-labeling. This approach is model-agnostic and can be combined with supervised algorithms to learn from unlabeled data. We prove that the core reason for this is lack of a clustering-friendly property in the embedding space.
arXiv Detail & Related papers (2022-09-27T19:04:36Z)
Towards Better Understanding Attribution Methods [77.1487219861185]
Post-hoc attribution methods have been proposed to identify image regions most influential to the models' decisions. We propose three novel evaluation schemes to more reliably measure the faithfulness of those methods. We also propose a post-processing smoothing step that significantly improves the performance of some attribution methods.
arXiv Detail & Related papers (2022-05-20T20:50:17Z)
Human-in-the-Loop Disinformation Detection: Stance, Sentiment, or Something Else? [93.91375268580806]
Both politics and pandemics have recently provided ample motivation for the development of machine learning-enabled disinformation (a.k.a. fake news) detection algorithms. Existing literature has focused primarily on the fully-automated case, but the resulting techniques cannot reliably detect disinformation on the varied topics, sources, and time scales required for military applications. By leveraging an already-available analyst as a human-in-the-loop, canonical machine learning techniques of sentiment analysis, aspect-based sentiment analysis, and stance detection become plausible methods to use for a partially-automated disinformation detection system.
arXiv Detail & Related papers (2021-11-09T13:30:34Z)
A Comprehensive Study on Learning-Based PE Malware Family Classification Methods [9.142578100395909]
Portable Executable (PE) malware has been consistently evolving in terms of both volume and sophistication. Three mainstream approaches that use learning based algorithms, as categorized by the input format the methods take, are image-based, binary-based and disassembly-based approaches. In this work, we conduct a thorough empirical study on learning-based PE malware classification approaches on 4 different datasets and consistent experiment settings.
arXiv Detail & Related papers (2021-10-29T05:32:28Z)
Cross-Domain Similarity Learning for Face Recognition in Unseen Domains [90.35908506994365]
We introduce a novel cross-domain metric learning loss, which we dub Cross-Domain Triplet (CDT) loss, to improve face recognition in unseen domains. The CDT loss encourages learning semantically meaningful features by enforcing compact feature clusters of identities from one domain. Our method does not require careful hard-pair sample mining and filtering strategy during training.
arXiv Detail & Related papers (2021-03-12T19:48:01Z)
CIMON: Towards High-quality Hash Codes [63.37321228830102]
We propose a new method named textbfComprehensive stextbfImilarity textbfMining and ctextbfOnsistency leartextbfNing (CIMON) First, we use global refinement and similarity statistical distribution to obtain reliable and smooth guidance. Second, both semantic and contrastive consistency learning are introduced to derive both disturb-invariant and discriminative hash codes.
arXiv Detail & Related papers (2020-10-15T14:47:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.