You Augment Me: Exploring ChatGPT-based Data Augmentation for Semantic Code Search
- URL: http://arxiv.org/abs/2408.05542v2
- Date: Sat, 17 Aug 2024 11:18:16 GMT
- Title: You Augment Me: Exploring ChatGPT-based Data Augmentation for Semantic Code Search
- Authors: Yanlin Wang, Lianghong Guo, Ensheng Shi, Wenqing Chen, Jiachi Chen, Wanjun Zhong, Menghan Wang, Hui Li, Hongyu Zhang, Ziyu Lyu, Zibin Zheng,
- Abstract summary: Code search plays a crucial role in software development, enabling developers to retrieve and reuse code using natural language queries.
Recently, large language models (LLMs) have made remarkable progress in both natural and programming language understanding and generation.
We propose a novel approach ChatDANCE, which utilizes high-quality and diverse augmented data generated by a large language model.
- Score: 47.54163552754051
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code search plays a crucial role in software development, enabling developers to retrieve and reuse code using natural language queries. While the performance of code search models improves with an increase in high-quality data, obtaining such data can be challenging and expensive. Recently, large language models (LLMs) such as ChatGPT have made remarkable progress in both natural and programming language understanding and generation, offering user-friendly interaction via simple prompts. Inspired by these advancements, we propose a novel approach ChatDANCE, which utilizes high-quality and diverse augmented data generated by a large language model and leverages a filtering mechanism to eliminate low-quality augmentations. Specifically, we first propose a set of ChatGPT prompting rules that are specifically designed for source code and queries. Then, we leverage ChatGPT to rewrite code and queries based on the according prompts and then propose a filtering mechanism which trains a cross-encoder from the backbone model UniXcoder to filter out code and query pairs with low matching scores. Finally, we re-train the backbone model using the obtained high-quality augmented data. Experimental results show that ChatDANCE achieves state-of-the-art performance, improving the best baseline by 13.2% (R@1) and 7% (MRR). Surprisingly, we find that this augment-filter-retrain strategy enables the backbone model (UniXcoder) to self-grow. Moreover, extensive experiments show the effectiveness of each component and ChatDANCE has stable performance under different hyperparameter settings. In addition, we conduct qualitative and quantitative analyses to investigate why ChatDANCE works well and find that it learns a more uniform distribution of representations and effectively aligns the code and query spaces.
Related papers
- CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval [103.116634967815]
We introduce CodeXEmbed, a family of large-scale code embedding models ranging from 400M to 7B parameters.
Our novel training pipeline unifies multiple programming languages and transforms various code-related tasks into a common retrieval framework.
Our 7B model sets a new state-of-the-art (SOTA) in code retrieval, outperforming the previous leading model, Voyage-Code, by over 20% on CoIR benchmark.
arXiv Detail & Related papers (2024-11-19T16:54:45Z) - LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system.
We build a novel data-cleaning pipeline that uses these principles to transform existing programs.
We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - Discriminating Human-authored from ChatGPT-Generated Code Via
Discernable Feature Analysis [2.9398911304923447]
This paper specifically aims to distinguish code generated by ChatGPT from that authored by humans.
We devise a dataset cleansing technique, which employs temporal and spatial segmentation, to mitigate the dearth of datasets.
To further enrich data resources, we employ "code transformation," "feature transformation," and "feature customization" techniques, generating an extensive dataset comprising 10,000 lines of ChatGPT-generated code.
arXiv Detail & Related papers (2023-06-26T03:15:06Z) - Enhancing Chat Language Models by Scaling High-quality Instructional
Conversations [91.98516412612739]
We first provide a systematically designed, diverse, informative, large-scale dataset of instructional conversations, UltraChat.
Our objective is to capture the breadth of interactions that a human might have with an AI assistant.
We fine-tune a LLaMA model to create a powerful conversational model, UltraLLaMA.
arXiv Detail & Related papers (2023-05-23T16:49:14Z) - Improving ChatGPT Prompt for Code Generation [13.303599826870705]
OpenAI's language model ChatGPT has emerged as a powerful tool for generating human-like responses to a wide range of textual inputs.
We evaluate ChatGPT's capabilities for two code generation tasks, including text-to-code and code-to-code generation.
Our results showed that by carefully designing prompts to guide ChatGPT, the generation performance can be improved substantially.
arXiv Detail & Related papers (2023-05-15T05:37:33Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - GQE-PRF: Generative Query Expansion with Pseudo-Relevance Feedback [8.142861977776256]
We propose a novel approach which effectively integrates text generation models into PRF-based query expansion.
Our approach generates augmented query terms via neural text generation models conditioned on both the initial query and pseudo-relevance feedback.
We evaluate the performance of our approach on information retrieval tasks using two benchmark datasets.
arXiv Detail & Related papers (2021-08-13T01:09:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.