Cross-Language Binary-Source Code Matching with Intermediate
Representations
- URL: http://arxiv.org/abs/2201.07420v1
- Date: Wed, 19 Jan 2022 05:17:02 GMT
- Title: Cross-Language Binary-Source Code Matching with Intermediate
Representations
- Authors: Yi Gui, Yao Wan, Hongyu Zhang, Huifang Huang, Yulei Sui, Guandong Xu,
Zhiyuan Shao, Hai Jin
- Abstract summary: This paper formulates the problem of cross-language binary-source code matching, and develops a new dataset for this new problem.
We present a novel approach XLIR, which is a Transformer-based neural network by learning the intermediate representations for both binary and source code.
Our proposed XLIR with intermediate representations significantly outperforms other state-of-the-art models in both of the two tasks.
- Score: 27.843666274502198
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Binary-source code matching plays an important role in many security and
software engineering related tasks such as malware detection, reverse
engineering and vulnerability assessment. Currently, several approaches have
been proposed for binary-source code matching by jointly learning the
embeddings of binary code and source code in a common vector space. Despite
much effort, existing approaches target on matching the binary code and source
code written in a single programming language. However, in practice, software
applications are often written in different programming languages to cater for
different requirements and computing platforms. Matching binary and source code
across programming languages introduces additional challenges when maintaining
multi-language and multi-platform applications. To this end, this paper
formulates the problem of cross-language binary-source code matching, and
develops a new dataset for this new problem. We present a novel approach XLIR,
which is a Transformer-based neural network by learning the intermediate
representations for both binary and source code. To validate the effectiveness
of XLIR, comprehensive experiments are conducted on two tasks of cross-language
binary-source code matching, and cross-language source-source code matching, on
top of our curated dataset. Experimental results and analysis show that our
proposed XLIR with intermediate representations significantly outperforms other
state-of-the-art models in both of the two tasks.
Related papers
- How Far Have We Gone in Stripped Binary Code Understanding Using Large Language Models [51.527805834378974]
We propose a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in binary code understanding.
Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2024-04-15T14:44:08Z) - DA-Net: A Disentangled and Adaptive Network for Multi-Source
Cross-Lingual Transfer Learning [11.78085199896157]
Multi-Source cross-lingual transfer learning deals with the transfer of task knowledge from multiple labelled source languages to an unlabeled target language under the language shift.
We propose a Disentangled and Adaptive Network (DA-Net) to address these challenges.
arXiv Detail & Related papers (2024-03-07T02:30:46Z) - IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators [49.903001442804594]
This work investigates the prospect of leveraging compiler intermediate representations (IR) to improve the multilingual capabilities of Code-LMs.
We first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files.
Next, we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to learn the IR language.
Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics.
arXiv Detail & Related papers (2024-03-06T17:52:08Z) - AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual
Adaptation for Code Clone Detection [69.79627042058048]
AdaCCD is a novel cross-lingual adaptation method that can detect cloned codes in a new language without annotations in that language.
We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages.
arXiv Detail & Related papers (2023-11-13T12:20:48Z) - Pre-Training Representations of Binary Code Using Contrastive Learning [14.1548548120994]
We propose a COntrastive learning Model for Binary cOde Analysis, or COMBO, that incorporates source code and comment information into binary code during representation learning.
COMBO is the first language representation model that incorporates source code, binary code, and comments into contrastive code representation learning.
arXiv Detail & Related papers (2022-10-11T02:39:06Z) - XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence [9.673614921946932]
This paper introduces XLCoST, Cross-Lingual Code SnippeT dataset, a new benchmark dataset for cross-lingual code intelligence.
Our dataset contains fine-grained parallel data from 8 languages, and supports 10 cross-lingual code tasks.
arXiv Detail & Related papers (2022-06-16T22:49:39Z) - Using Document Similarity Methods to create Parallel Datasets for Code
Translation [60.36392618065203]
Translating source code from one programming language to another is a critical, time-consuming task.
We propose to use document similarity methods to create noisy parallel datasets of code.
We show that these models perform comparably to models trained on ground truth for reasonable levels of noise.
arXiv Detail & Related papers (2021-10-11T17:07:58Z) - Incorporating External Knowledge through Pre-training for Natural
Language to Code Generation [97.97049697457425]
Open-domain code generation aims to generate code in a general-purpose programming language from natural language (NL) intents.
We explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation.
Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa.
arXiv Detail & Related papers (2020-04-20T01:45:27Z) - Bi-Decoder Augmented Network for Neural Machine Translation [108.3931242633331]
We propose a novel Bi-Decoder Augmented Network (BiDAN) for the neural machine translation task.
Since each decoder transforms the representations of the input text into its corresponding language, jointly training with two target ends can make the shared encoder has the potential to produce a language-independent semantic space.
arXiv Detail & Related papers (2020-01-14T02:05:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.