On Training a Neural Network to Explain Binaries
- URL: http://arxiv.org/abs/2404.19631v1
- Date: Tue, 30 Apr 2024 15:34:51 GMT
- Title: On Training a Neural Network to Explain Binaries
- Authors: Alexander Interrante-Grant, Andy Davis, Heather Preslier, Tim Leek,
- Abstract summary: In this work, we investigate the possibility of training a deep neural network on the task of binary code understanding.
We build our own dataset derived from a capture of Stack Overflow containing 1.1M entries.
- Score: 43.27448128029069
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we begin to investigate the possibility of training a deep neural network on the task of binary code understanding. Specifically, the network would take, as input, features derived directly from binaries and output English descriptions of functionality to aid a reverse engineer in investigating the capabilities of a piece of closed-source software, be it malicious or benign. Given recent success in applying large language models (generative AI) to the task of source code summarization, this seems a promising direction. However, in our initial survey of the available datasets, we found nothing of sufficiently high quality and volume to train these complex models. Instead, we build our own dataset derived from a capture of Stack Overflow containing 1.1M entries. A major result of our work is a novel dataset evaluation method using the correlation between two distances on sample pairs: one distance in the embedding space of inputs and the other in the embedding space of outputs. Intuitively, if two samples have inputs close in the input embedding space, their outputs should also be close in the output embedding space. We found this Embedding Distance Correlation (EDC) test to be highly diagnostic, indicating that our collected dataset and several existing open-source datasets are of low quality as the distances are not well correlated. We proceed to explore the general applicability of EDC, applying it to a number of qualitatively known good datasets and a number of synthetically known bad ones and found it to be a reliable indicator of dataset value.
Related papers
- Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance.
DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator.
Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z) - Capture the Flag: Uncovering Data Insights with Large Language Models [90.47038584812925]
This study explores the potential of using Large Language Models (LLMs) to automate the discovery of insights in data.
We propose a new evaluation methodology based on a "capture the flag" principle, measuring the ability of such models to recognize meaningful and pertinent information (flags) in a dataset.
arXiv Detail & Related papers (2023-12-21T14:20:06Z) - Knowledge Combination to Learn Rotated Detection Without Rotated
Annotation [53.439096583978504]
Rotated bounding boxes drastically reduce output ambiguity of elongated objects.
Despite the effectiveness, rotated detectors are not widely employed.
We propose a framework that allows the model to predict precise rotated boxes.
arXiv Detail & Related papers (2023-04-05T03:07:36Z) - Nearest Neighbor-Based Contrastive Learning for Hyperspectral and LiDAR
Data Classification [45.026868970899514]
We propose a Nearest Neighbor-based Contrastive Learning Network (NNCNet) to learn discriminative feature representations.
Specifically, we propose a nearest neighbor-based data augmentation scheme to use enhanced semantic relationships among nearby regions.
In addition, we design a bilinear attention module to exploit the second-order and even high-order feature interactions between the HSI and LiDAR data.
arXiv Detail & Related papers (2023-01-09T13:43:54Z) - Semi-Supervised Building Footprint Generation with Feature and Output
Consistency Training [17.6179873429447]
State-of-the-art semi-supervised semantic segmentation networks with consistency training can help to deal with this issue.
We propose to integrate the consistency of both features and outputs in the end-to-end network training of unlabeled samples.
Experimental results show that the proposed approach can well extract more complete building structures.
arXiv Detail & Related papers (2022-05-17T14:55:13Z) - Iterative Rule Extension for Logic Analysis of Data: an MILP-based
heuristic to derive interpretable binary classification from large datasets [0.6526824510982799]
This work presents IRELAND, an algorithm that allows for abstracting Boolean phrases in DNF from data with up to 10,000 samples and sample characteristics.
The results show that for large datasets IRELAND outperforms the current state-of-the-art and can find solutions for datasets where current models run out of memory or need excessive runtimes.
arXiv Detail & Related papers (2021-10-25T13:31:30Z) - Embracing Structure in Data for Billion-Scale Semantic Product Search [14.962039276966319]
We present principled approaches to train and deploy dyadic neural embedding models at the billion scale.
We show that exploiting the natural structure of real-world datasets helps address both challenges efficiently.
arXiv Detail & Related papers (2021-10-12T16:14:13Z) - Anchor-free Oriented Proposal Generator for Object Detection [59.54125119453818]
Oriented object detection is a practical and challenging task in remote sensing image interpretation.
Nowadays, oriented detectors mostly use horizontal boxes as intermedium to derive oriented boxes from them.
We propose a novel Anchor-free Oriented Proposal Generator (AOPG) that abandons the horizontal boxes-related operations from the network architecture.
arXiv Detail & Related papers (2021-10-05T10:45:51Z) - Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts.
We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data.
We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.