Related papers: CoRNStack: High-Quality Contrastive Data for Better Code Ranking

CoRNStack: High-Quality Contrastive Data for Better Code Ranking

URL: http://arxiv.org/abs/2412.01007v2
Date: Wed, 04 Dec 2024 20:01:42 GMT
Title: CoRNStack: High-Quality Contrastive Data for Better Code Ranking
Authors: Tarun Suresh, Revanth Gangi Reddy, Yifei Xu, Zach Nussbaum, Andriy Mulyar, Brandon Duderstadt, Heng Ji,
Abstract summary: We introduce CoRNStack, a large-scale, high-quality contrastive training dataset for code that spans multiple programming languages.<n>This dataset is curated using consistency filtering to eliminate noisy positives and is further enriched with mined hard negatives.<n>We demonstrate that contrastive training of embedding models using CoRNStack leads to state-of-the-art performance across a variety of code retrieval tasks.
Score: 45.18877655831977
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Effective code retrieval plays a crucial role in advancing code generation, bug fixing, and software maintenance, particularly as software systems increase in complexity. While current code embedding models have demonstrated promise in retrieving code snippets for small-scale, well-defined tasks, they often underperform in more demanding real-world applications such as bug localization within GitHub repositories. We hypothesize that a key issue is their reliance on noisy and inconsistent datasets for training, which impedes their ability to generalize to more complex retrieval scenarios. To address these limitations, we introduce CoRNStack, a large-scale, high-quality contrastive training dataset for code that spans multiple programming languages. This dataset is curated using consistency filtering to eliminate noisy positives and is further enriched with mined hard negatives, thereby facilitating more effective learning. We demonstrate that contrastive training of embedding models using CoRNStack leads to state-of-the-art performance across a variety of code retrieval tasks. Furthermore, the dataset can be leveraged for training code reranking models, a largely underexplored area compared to text reranking. Our finetuned code reranking model significantly improves the ranking quality over the retrieved results. Finally, by employing our code retriever and reranker together, we demonstrate significant improvements in function localization for GitHub issues, an important component of real-world software development.

Related papers

Knowledge Graph Based Repository-Level Code Generation [0.0]
This paper introduces a novel knowledge graph-based approach to improve code search and retrieval.<n>The proposed approach represents code repositories as graphs, capturing structural and relational information for enhanced context-aware code generation.<n>We benchmark the proposed approach on the Evolutionary Code Benchmark dataset, a repository-level code generation benchmark, and demonstrate that our method significantly outperforms the baseline approach.
arXiv Detail & Related papers (2025-05-20T14:13:59Z)
Towards A Generalist Code Embedding Model Based On Massive Data Synthesis [35.04242699869519]
We introduce textbfCodeR (underlineCode underlineRetrieval), a state-of-the-art embedding model for general-purpose code retrieval.<n>The superior performance of CodeR is built upon CodeR-Pile, a large-scale synthetic dataset constructed under the DRU principle.
arXiv Detail & Related papers (2025-05-19T04:37:53Z)
SweRank: Software Issue Localization with Code Ranking [109.3289316191729]
SweRank is an efficient retrieve-and-rerank framework for software issue localization.<n>We construct SweLoc, a large-scale dataset curated from public GitHub repositories.<n>We show that SweRank achieves state-of-the-art performance, outperforming both prior ranking models and costly agent-based systems.
arXiv Detail & Related papers (2025-05-07T19:44:09Z)
UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process. Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z)
SnipGen: A Mining Repository Framework for Evaluating LLMs for Code [51.07471575337676]
Language Models (LLMs) are trained on extensive datasets that include code repositories. evaluating their effectiveness poses significant challenges due to the potential overlap between the datasets used for training and those employed for evaluation. We introduce SnipGen, a comprehensive repository mining framework designed to leverage prompt engineering across various downstream tasks for code generation.
arXiv Detail & Related papers (2025-02-10T21:28:15Z)
Contextualized Data-Wrangling Code Generation in Computational Notebooks [131.26365849822932]
We propose an automated approach, CoCoMine, to mine data-wrangling code generation examples with clear multi-modal contextual dependency. We construct CoCoNote, a dataset containing 58,221 examples for Contextualized Data-wrangling Code generation in Notebooks. Experiment results demonstrate the significance of incorporating data context in data-wrangling code generation.
arXiv Detail & Related papers (2024-09-20T14:49:51Z)
An Empirical Study on Self-correcting Large Language Models for Data Science Code Generation [1.335664823620186]
Large Language Models (LLMs) have recently advanced many applications on software engineering tasks. CoT-SelfEvolve iteratively and automatically refines code through a self-correcting process.
arXiv Detail & Related papers (2024-08-28T09:19:09Z)
Code Less, Align More: Efficient LLM Fine-tuning for Code Generation with Data Pruning [4.975728472540823]
We present techniques that integrate various clustering and pruning metrics to selectively reduce training data without compromising the accuracy and functionality of the generated code. Our experiments show that these pruning strategies not only reduce the computational resources needed but also enhance the overall quality code generation.
arXiv Detail & Related papers (2024-07-06T10:30:43Z)
LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system. We build a novel data-cleaning pipeline that uses these principles to transform existing programs. We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z)
Enhancing Semantic Code Search with Multimodal Contrastive Learning and Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z)
Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph. It is updated by decoding in the context of an auto-encoder. Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z)
Leveraging Code Generation to Improve Code Retrieval and Summarization via Dual Learning [18.354352985591305]
Code summarization generates brief natural language description given a source code snippet, while code retrieval fetches relevant source code given a natural language query. Recent studies have combined these two tasks to improve their performance. We propose a novel end-to-end model for the two tasks by introducing an additional code generation task.
arXiv Detail & Related papers (2020-02-24T12:26:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.