SACL: Understanding and Combating Textual Bias in Code Retrieval with Semantic-Augmented Reranking and Localization
- URL: http://arxiv.org/abs/2506.20081v2
- Date: Thu, 26 Jun 2025 04:06:50 GMT
- Title: SACL: Understanding and Combating Textual Bias in Code Retrieval with Semantic-Augmented Reranking and Localization
- Authors: Dhruv Gupta, Gayathri Ganesh Lakshmy, Yiqing Xie,
- Abstract summary: Retrieval-Augmented Code Generation (RACG) is a critical technique for enhancing code generation by retrieving relevant information.<n>In this work, we conduct an in-depth analysis of code retrieval by systematically masking specific features while preserving code functionality.<n>We propose SACL, a framework that enriches textual information and reduces bias by augmenting code or structural knowledge with semantic information.
- Score: 7.4705664339075115
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Retrieval-Augmented Code Generation (RACG) is a critical technique for enhancing code generation by retrieving relevant information. In this work, we conduct an in-depth analysis of code retrieval by systematically masking specific features while preserving code functionality. Our discoveries include: (1) although trained on code, current retrievers heavily rely on surface-level textual features (e.g., docstrings, identifier names), and (2) they exhibit a strong bias towards well-documented code, even if the documentation is irrelevant. Based on our discoveries, we propose SACL, a framework that enriches textual information and reduces bias by augmenting code or structural knowledge with semantic information. Extensive experiments show that SACL substantially improves code retrieval (e.g., by 12.8% / 9.4% / 7.0% Recall@1 on HumanEval / MBPP / SWE-Bench-Lite), which also leads to better code generation performance (e.g., by 4.88% Pass@1 on HumanEval).
Related papers
- Retrieval-Augmented Code Review Comment Generation [0.0]
Automated code review comment generation (RCG) aims to assist developers by automatically producing natural language feedback for code changes.<n>Existing approaches are primarily either generation-based, using pretrained language models, or information retrieval-based (IR), reusing comments from similar past examples.<n>This work proposes to leverage retrieval-augmented generation (RAG) for RCG by conditioning pretrained language models on retrieved code-review exemplars.
arXiv Detail & Related papers (2025-06-13T08:58:20Z) - Code-Craft: Hierarchical Graph-Based Code Summarization for Enhanced Context Retrieval [0.0]
We present Hierarchical Code Graph Summarization (HCGS), a novel approach that constructs a multi-layered representation of a by generating structured summaries in a bottom-up fashion from a code graph.<n>HCGS consistently outperforms traditional code-only retrieval across all metrics.
arXiv Detail & Related papers (2025-04-11T20:57:27Z) - What to Retrieve for Effective Retrieval-Augmented Code Generation? An Empirical Study and Beyond [32.467437657603604]
Repository-level code generation remains challenging due to complex code dependencies and the limitations of large language models (LLMs) in processing long contexts.<n>We propose AllianceCoder, a novel context-integrated method that employs chain-of-thought prompting to decompose user queries into implementation steps and retrieves APIs via semantic description matching.<n>Through extensive experiments on CoderEval and RepoExec, AllianceCoder achieves state-of-the-art performance, improving Pass@1 by up to 20% over existing approaches.
arXiv Detail & Related papers (2025-03-26T14:41:38Z) - CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation.<n>We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks.<n>We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z) - FoC: Figure out the Cryptographic Functions in Stripped Binaries with LLMs [51.898805184427545]
We propose a novel framework called FoC to Figure out the Cryptographic functions in stripped binaries.<n>We first build a binary large language model (FoC-BinLLM) to summarize the semantics of cryptographic functions in natural language.<n>We then build a binary code similarity model (FoC-Sim) upon the FoC-BinLLM to create change-sensitive representations and use it to retrieve similar implementations of unknown cryptographic functions in a database.
arXiv Detail & Related papers (2024-03-27T09:45:33Z) - EVOR: Evolving Retrieval for Code Generation [17.46870626157077]
Existing pipelines for retrieval-augmented code generation employ static knowledge bases with a single source.<n>We develop a novel pipeline, EVOR, that employs the synchronous evolution of both queries and diverse knowledge bases.
arXiv Detail & Related papers (2024-02-19T17:37:28Z) - Rewriting the Code: A Simple Method for Large Language Model Augmented Code Search [7.822427053078387]
Generation-Augmented Retrieval (GAR) framework generates exemplar code snippets to augment queries.
We propose a simple yet effective method that additionally Rewrites the Code (ReCo) within the for style normalization.
Code Style Similarity is the first metric tailored to quantify stylistic similarities in code.
arXiv Detail & Related papers (2024-01-09T12:12:50Z) - Soft-Labeled Contrastive Pre-training for Function-level Code
Representation [127.71430696347174]
We present textbfSCodeR, a textbfSoft-labeled contrastive pre-training framework with two positive sample construction methods.
Considering the relevance between codes in a large-scale code corpus, the soft-labeled contrastive pre-training can obtain fine-grained soft-labels.
SCodeR achieves new state-of-the-art performance on four code-related tasks over seven datasets.
arXiv Detail & Related papers (2022-10-18T05:17:37Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z) - COSEA: Convolutional Code Search with Layer-wise Attention [90.35777733464354]
We propose a new deep learning architecture, COSEA, which leverages convolutional neural networks with layer-wise attention to capture the code's intrinsic structural logic.
COSEA can achieve significant improvements over state-of-the-art methods on code search tasks.
arXiv Detail & Related papers (2020-10-19T13:53:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.