Synthetic Datasets for Program Similarity Research
- URL: http://arxiv.org/abs/2405.03478v1
- Date: Mon, 6 May 2024 13:52:02 GMT
- Title: Synthetic Datasets for Program Similarity Research
- Authors: Alexander Interrante-Grant, Michael Wang, Lisa Baer, Ryan Whelan, Tim Leek,
- Abstract summary: HELIX is a framework for generating large, synthetic program similarity datasets.
Blind HELIX is a tool built on top of HELIX for extracting HELIX components from library code automatically using program slicing.
- Score: 39.91303506884272
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Program similarity has become an increasingly popular area of research with various security applications such as plagiarism detection, author identification, and malware analysis. However, program similarity research faces a few unique dataset quality problems in evaluating the effectiveness of novel approaches. First, few high-quality datasets for binary program similarity exist and are widely used in this domain. Second, there are potentially many different, disparate definitions of what makes one program similar to another and in many cases there is often a large semantic gap between the labels provided by a dataset and any useful notion of behavioral or semantic similarity. In this paper, we present HELIX - a framework for generating large, synthetic program similarity datasets. We also introduce Blind HELIX, a tool built on top of HELIX for extracting HELIX components from library code automatically using program slicing. We evaluate HELIX and Blind HELIX by comparing the performance of program similarity tools on a HELIX dataset to a hand-crafted dataset built from multiple, disparate notions of program similarity. Using Blind HELIX, we show that HELIX can generate realistic and useful datasets of virtually infinite size for program similarity research with ground truth labels that embody practical notions of program similarity. Finally, we discuss the results and reason about relative tool ranking.
Related papers
- SiReRAG: Indexing Similar and Related Information for Multihop Reasoning [96.60045548116584]
SiReRAG is a novel RAG indexing approach that explicitly considers both similar and related information.
SiReRAG consistently outperforms state-of-the-art indexing methods on three multihop datasets.
arXiv Detail & Related papers (2024-12-09T04:56:43Z) - Outlier Detection in Large Radiological Datasets using UMAP [1.206248959194646]
In biomedical data, variations in image quality, labeling, reports, and archiving can lead to errors, inconsistencies, and repeated samples.
Here, we show that the uniform manifold approximation and projection algorithm can find these anomalies essentially by forming independent clusters.
While the results are archival and retrospective, the graph-based methods work for any data type and will prove equally beneficial for curation at the time of dataset creation.
arXiv Detail & Related papers (2024-07-31T00:56:06Z) - Advanced Detection of Source Code Clones via an Ensemble of Unsupervised Similarity Measures [0.0]
This research introduces a novel ensemble learning approach for code similarity assessment.
The key idea is that the strengths of a diverse set of similarity measures can complement each other and mitigate individual weaknesses.
arXiv Detail & Related papers (2024-05-03T13:42:49Z) - Relation-aware Ensemble Learning for Knowledge Graph Embedding [68.94900786314666]
We propose to learn an ensemble by leveraging existing methods in a relation-aware manner.
exploring these semantics using relation-aware ensemble leads to a much larger search space than general ensemble methods.
We propose a divide-search-combine algorithm RelEns-DSC that searches the relation-wise ensemble weights independently.
arXiv Detail & Related papers (2023-10-13T07:40:12Z) - EMBERSim: A Large-Scale Databank for Boosting Similarity Search in
Malware Analysis [48.5877840394508]
In recent years there has been a shift from quantifications-based malware detection towards machine learning.
We propose to address the deficiencies in the space of similarity research on binary files, starting from EMBER.
We enhance EMBER with similarity information as well as malware class tags, to enable further research in the similarity space.
arXiv Detail & Related papers (2023-10-03T06:58:45Z) - Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models.
In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z) - Hierarchical Locality Sensitive Hashing for Structured Data: A Survey [8.045541999149002]
Locality Sensitive Hashing (LSH) technique has been proposed to provide accurate estimators for various similarity measures between sets or vectors.
In this paper, we explore the present progress of the research into hierarchical LSH algorithms.
arXiv Detail & Related papers (2022-04-24T07:18:04Z) - Attributable Visual Similarity Learning [90.69718495533144]
This paper proposes an attributable visual similarity learning (AVSL) framework for a more accurate and explainable similarity measure between images.
Motivated by the human semantic similarity cognition, we propose a generalized similarity learning paradigm to represent the similarity between two images with a graph.
Experiments on the CUB-200-2011, Cars196, and Stanford Online Products datasets demonstrate significant improvements over existing deep similarity learning methods.
arXiv Detail & Related papers (2022-03-28T17:35:31Z) - User-friendly Comparison of Similarity Algorithms on Wikidata [2.8551587610394904]
We present a user-friendly interface that allows flexible computation of similarity between Qnodes in Wikidata.
At present, the similarity interface supports four algorithms, based on: graph embeddings (TransE, ComplEx), text embeddings (BERT) and class-based similarity.
We also provide a REST API that can compute most similar neighbors for any Qnode in Wikidata.
arXiv Detail & Related papers (2021-08-11T18:59:25Z) - Hierarchical Similarity Learning for Language-based Product Image
Retrieval [40.83290730640458]
This paper focuses on the cross-modal similarity measurement, and proposes a novel Hierarchical Similarity Learning network.
Experiments on a large-scale product retrieval dataset demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2021-02-18T14:23:16Z) - Efficient Clustering from Distributions over Topics [0.0]
We present an approach that relies on the results of a topic modeling algorithm over documents in a collection as a means to identify smaller subsets of documents where the similarity function can be computed.
This approach has proved to obtain promising results when identifying similar documents in the domain of scientific publications.
arXiv Detail & Related papers (2020-12-15T10:52:19Z) - LSF-Join: Locality Sensitive Filtering for Distributed All-Pairs Set
Similarity Under Skew [58.21885402826496]
All-pairs set similarity is a widely used data mining task, even for large and high-dimensional datasets.
We present a new distributed algorithm, LSF-Join, for approximate all-pairs set similarity.
We show that LSF-Join efficiently finds most close pairs, even for small similarity thresholds and for skewed input sets.
arXiv Detail & Related papers (2020-03-06T00:06:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.