Exploring Representation-Level Augmentation for Code Search
- URL: http://arxiv.org/abs/2210.12285v1
- Date: Fri, 21 Oct 2022 22:47:37 GMT
- Title: Exploring Representation-Level Augmentation for Code Search
- Authors: Haochen Li, Chunyan Miao, Cyril Leung, Yanxian Huang, Yuan Huang,
Hongyu Zhang, Yanlin Wang
- Abstract summary: We explore augmentation methods that augment data (both code and query) at representation level which does not require additional data processing and training.
We experimentally evaluate the proposed representation-level augmentation methods with state-of-the-art code search models on a large-scale public dataset.
- Score: 50.94201167562845
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code search, which aims at retrieving the most relevant code fragment for a
given natural language query, is a common activity in software development
practice. Recently, contrastive learning is widely used in code search
research, where many data augmentation approaches for source code (e.g.,
semantic-preserving program transformation) are proposed to learn better
representations. However, these augmentations are at the raw-data level, which
requires additional code analysis in the preprocessing stage and additional
training costs in the training stage. In this paper, we explore augmentation
methods that augment data (both code and query) at representation level which
does not require additional data processing and training, and based on this we
propose a general format of representation-level augmentation that unifies
existing methods. Then, we propose three new augmentation methods (linear
extrapolation, binary interpolation, and Gaussian scaling) based on the general
format. Furthermore, we theoretically analyze the advantages of the proposed
augmentation methods over traditional contrastive learning methods on code
search. We experimentally evaluate the proposed representation-level
augmentation methods with state-of-the-art code search models on a large-scale
public dataset consisting of six programming languages. The experimental
results show that our approach can consistently boost the performance of the
studied code search models. Our source code is available at
https://github.com/Alex-HaochenLi/RACS.
Related papers
- From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models [63.188607839223046]
This survey focuses on the benefits of scaling compute during inference.
We explore three areas under a unified mathematical formalism: token-level generation algorithms, meta-generation algorithms, and efficient generation.
arXiv Detail & Related papers (2024-06-24T17:45:59Z) - Enhancing Source Code Representations for Deep Learning with Static
Analysis [10.222207222039048]
This paper explores the integration of static analysis and additional context such as bug reports and design patterns into source code representations for deep learning models.
We use the Abstract Syntax Tree-based Neural Network (ASTNN) method and augment it with additional context information obtained from bug reports and design patterns.
Our approach improves the representation and processing of source code, thereby improving task performance.
arXiv Detail & Related papers (2024-02-14T20:17:04Z) - Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in
Dense Encoders [63.28408887247742]
We study whether training procedures can be improved to yield better generalization capabilities in the resulting models.
We recommend a simple recipe for training dense encoders: Train on MSMARCO with parameter-efficient methods, such as LoRA, and opt for using in-batch negatives unless given well-constructed hard negatives.
arXiv Detail & Related papers (2023-11-16T10:42:58Z) - REINFOREST: Reinforcing Semantic Code Similarity for Cross-Lingual Code Search Models [11.78036105494679]
This paper introduces a novel code-to-code search technique that enhances the performance of Large Language Models (LLMs)
We present the first-ever code search method that encodes dynamic information during training without the need to execute either the corpus under search or the search query at inference time.
arXiv Detail & Related papers (2023-05-05T20:46:56Z) - Boosting Source Code Learning with Data Augmentation: An Empirical Study [16.49710700412084]
We study whether data augmentation methods originally used for text and graphs are effective in improving the training quality of source code learning.
Our results identify the data augmentation methods that can produce more accurate and robust models for source code learning.
arXiv Detail & Related papers (2023-03-13T01:47:05Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - Data Augmentation for Opcode Sequence Based Malware Detection [2.335152769484957]
We study different methods of data augmentation starting with basic methods using fixed transformations and moving to methods that adapt to the data.
We propose a novel data augmentation method based on using an opcode embedding layer within the network and its corresponding opcode embedding matrix.
To the best of our knowledge this is the first paper to carry out a systematic study of different augmentation methods applied to opcode sequence based malware classification.
arXiv Detail & Related papers (2021-06-22T14:36:35Z) - A Transformer-based Approach for Source Code Summarization [86.08359401867577]
We learn code representation for summarization by modeling the pairwise relationship between code tokens.
We show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.
arXiv Detail & Related papers (2020-05-01T23:29:36Z) - Reinforcement Learning with Augmented Data [97.42819506719191]
We present Reinforcement Learning with Augmented Data (RAD), a simple plug-and-play module that can enhance most RL algorithms.
We show that augmentations such as random translate, crop, color jitter, patch cutout, random convolutions, and amplitude scale can enable simple RL algorithms to outperform complex state-of-the-art methods.
arXiv Detail & Related papers (2020-04-30T17:35:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.