CLAP: Learning Transferable Binary Code Representations with Natural
Language Supervision
- URL: http://arxiv.org/abs/2402.16928v1
- Date: Mon, 26 Feb 2024 13:49:52 GMT
- Title: CLAP: Learning Transferable Binary Code Representations with Natural
Language Supervision
- Authors: Hao Wang, Zeyu Gao, Chao Zhang, Zihan Sha, Mingyang Sun, Yuchen Zhou,
Wenyu Zhu, Wenju Sun, Han Qiu, Xi Xiao
- Abstract summary: We present CLAP (Contrastive Language-Assembly Pre-training), which employs natural language supervision to learn better representations of binary code.
At the core, our approach boosts superior transfer learning capabilities by effectively aligning binary code with their semantics explanations.
We have generated 195 million pairs of binary code and explanations and trained a prototype of CLAP.
- Score: 22.42846252594693
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Binary code representation learning has shown significant performance in
binary analysis tasks. But existing solutions often have poor transferability,
particularly in few-shot and zero-shot scenarios where few or no training
samples are available for the tasks. To address this problem, we present CLAP
(Contrastive Language-Assembly Pre-training), which employs natural language
supervision to learn better representations of binary code (i.e., assembly
code) and get better transferability. At the core, our approach boosts superior
transfer learning capabilities by effectively aligning binary code with their
semantics explanations (in natural language), resulting a model able to
generate better embeddings for binary code. To enable this alignment training,
we then propose an efficient dataset engine that could automatically generate a
large and diverse dataset comprising of binary code and corresponding natural
language explanations. We have generated 195 million pairs of binary code and
explanations and trained a prototype of CLAP. The evaluations of CLAP across
various downstream tasks in binary analysis all demonstrate exceptional
performance. Notably, without any task-specific training, CLAP is often
competitive with a fully supervised baseline, showing excellent
transferability. We release our pre-trained model and code at
https://github.com/Hustcw/CLAP.
Related papers
- Crystal: Illuminating LLM Abilities on Language and Code [58.5467653736537]
We propose a pretraining strategy to enhance the integration of natural language and coding capabilities.
The resulting model, Crystal, demonstrates remarkable capabilities in both domains.
arXiv Detail & Related papers (2024-11-06T10:28:46Z) - How Far Have We Gone in Binary Code Understanding Using Large Language Models [51.527805834378974]
We propose a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in binary code understanding.
Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2024-04-15T14:44:08Z) - IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators [49.903001442804594]
This work investigates the prospect of leveraging compiler intermediate representations (IR) to improve the multilingual capabilities of Code-LMs.
We first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files.
Next, we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to learn the IR language.
Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics.
arXiv Detail & Related papers (2024-03-06T17:52:08Z) - Code Representation Learning At Scale [75.04686476303436]
We fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme.
We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language.
We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner.
arXiv Detail & Related papers (2024-02-02T22:19:15Z) - TransformCode: A Contrastive Learning Framework for Code Embedding via Subtree Transformation [9.477734501499274]
We present TransformCode, a novel framework that learns code embeddings in a contrastive learning manner.
Our framework is encoder-agnostic and language-agnostic, which means that it can leverage any encoder model and handle any programming language.
arXiv Detail & Related papers (2023-11-10T09:05:23Z) - CP-BCS: Binary Code Summarization Guided by Control Flow Graph and
Pseudo Code [79.87518649544405]
We present a control flow graph and pseudo code guided binary code summarization framework called CP-BCS.
CP-BCS utilizes a bidirectional instruction-level control flow graph and pseudo code that incorporates expert knowledge to learn the comprehensive binary function execution behavior and logic semantics.
arXiv Detail & Related papers (2023-10-24T14:20:39Z) - TransCoder: Towards Unified Transferable Code Representation Learning Inspired by Human Skills [31.75121546422898]
We present TransCoder, a unified Transferable fine-tuning strategy for Code representation learning.
We employ a tunable prefix encoder as the meta-learner to capture cross-task and cross-language transferable knowledge.
Our method can lead to superior performance on various code-related tasks and encourage mutual reinforcement.
arXiv Detail & Related papers (2023-05-23T06:59:22Z) - LeTI: Learning to Generate from Textual Interactions [60.425769582343506]
We explore LMs' potential to learn from textual interactions (LETI) that not only check their correctness with binary labels but also pinpoint and explain errors in their outputs through textual feedback.
Our focus is the code generation task, where the model produces code based on natural language instructions.
LETI iteratively fine-tunes the model, using the objective LM, on a concatenation of natural language instructions, LM-generated programs, and textual feedback.
arXiv Detail & Related papers (2023-05-17T15:53:31Z) - Pre-Training Representations of Binary Code Using Contrastive Learning [13.570375923483452]
We propose a COntrastive learning Model for Binary cOde Analysis, or COMBO, that incorporates source code and comment information into binary code during representation learning.
COMBO is the first language representation model that incorporates source code, binary code, and comments into contrastive code representation learning.
arXiv Detail & Related papers (2022-10-11T02:39:06Z) - CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for
Code Understanding and Generation [36.47905744758698]
We present CodeT5, a unified pre-trained encoder-decoder Transformer model that better leverages the code semantics conveyed from the developer-assigned identifiers.
Our model employs a unified framework to seamlessly support both code understanding and generation tasks and allows for multi-task learning.
arXiv Detail & Related papers (2021-09-02T12:21:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.