Which Programming Language and Model Work Best With LLM-as-a-Judge For Code Retrieval?
- URL: http://arxiv.org/abs/2510.00324v1
- Date: Tue, 30 Sep 2025 22:31:36 GMT
- Title: Which Programming Language and Model Work Best With LLM-as-a-Judge For Code Retrieval?
- Authors: Lucas Roberts, Denisa Roberts,
- Abstract summary: Benefits of better code search include faster new developer on-boarding, reduced software maintenance, and ease of understanding for large repositories.<n>Despite improvements in search algorithms and search benchmarks, the domain of code search has lagged behind.<n>We study the use of Large Language Models (LLMs) to retrieve code at the level of functions and to generate annotations for code search results.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Code search is an important information retrieval application. Benefits of better code search include faster new developer on-boarding, reduced software maintenance, and ease of understanding for large repositories. Despite improvements in search algorithms and search benchmarks, the domain of code search has lagged behind. One reason is the high cost of human annotation for code queries and answers. While humans may annotate search results in general text QA systems, code annotations require specialized knowledge of a programming language (PL), as well as domain specific software engineering knowledge. In this work we study the use of Large Language Models (LLMs) to retrieve code at the level of functions and to generate annotations for code search results. We compare the impact of the retriever representation (sparse vs. semantic), programming language, and LLM by comparing human annotations across several popular languages (C, Java, Javascript, Go, and Python). We focus on repositories that implement common data structures likely to be implemented in any PLs. For the same human annotations, we compare several LLM-as-a-Judge models to evaluate programming language and other affinities between LLMs. We find that the chosen retriever and PL exhibit affinities that can be leveraged to improve alignment of human and AI relevance determinations, with significant performance implications. We also find differences in representation (sparse vs. semantic) across PLs that impact alignment of human and AI relevance determinations. We propose using transpilers to bootstrap scalable code search benchmark datasets in other PLs and in a case study demonstrate that human-AI relevance agreement rates largely match the (worst case) human-human agreement under study. The application code used in this work is available at \href{https://github.com/rlucas7/code-searcher/}{this github repo}.
Related papers
- AlignCoder: Aligning Retrieval with Target Intent for Repository-Level Code Completion [55.21541958868449]
We propose AlignCoder, a repository-level code completion framework.<n>Our framework generates an enhanced query that bridges the semantic gap between the initial query and the target code.<n>We employ reinforcement learning to train an AlignRetriever that learns to leverage inference information in the enhanced query for more accurate retrieval.
arXiv Detail & Related papers (2026-01-27T15:23:14Z) - Semantic Source Code Segmentation using Small and Large Language Models [2.5748316361772963]
This paper introduces an automated, domain-specific approach for research R code segmentation using Large and Small Language Models (LLMs/SLMs)<n>We explore two distinct approaches: line-by-line analysis with context and range-based segment determination.<n>Our results show that context-based line-by-line analysis is superior over range-based segmentation.
arXiv Detail & Related papers (2025-07-11T19:49:59Z) - SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors [7.210032327838313]
We investigate whether large language models (LLMs) can serve as surrogate models for code execution prediction.<n>Through extensive analysis of $21$ open-source and proprietary LLMs, we examine scaling laws, data efficiency, and predictive accuracy.<n>Our findings reveal important insights about the feasibility of LLMs as efficient surrogates for computational processes.
arXiv Detail & Related papers (2025-02-16T15:38:19Z) - Evaluating and Aligning CodeLLMs on Human Preference [42.26173776584043]
We present a rigorous human-curated benchmark CodeArena to emulate the complexity and diversity of real-world coding tasks.<n>We also propose a diverse synthetic instruction corpus SynCode-Instruct to verify the effectiveness of the large-scale synthetic instruction fine-tuning.<n>The results find performance differences between execution-based benchmarks and CodeArena.
arXiv Detail & Related papers (2024-12-06T17:40:38Z) - CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.1875460416205]
The CRUXEVAL-X code reasoning benchmark contains 19 programming languages.<n>It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total.<n>Even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages.
arXiv Detail & Related papers (2024-08-23T11:43:00Z) - Bug In the Code Stack: Can LLMs Find Bugs in Large Python Code Stacks [1.3586572110652484]
This study explores the capabilities of Large Language Models (LLMs) in retrieving contextual information from large text documents.
Our benchmark, Bug In The Code Stack (BICS), is designed to assess the ability of LLMs to identify simple syntax bugs within large source code.
Our findings reveal three key insights: (1) code-based environments pose significantly more challenge compared to text-based environments for retrieval tasks, (2) there is a substantial performance disparity among different models, and (3) there is a notable correlation between longer context lengths and performance degradation.
arXiv Detail & Related papers (2024-06-21T17:37:10Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z) - Learning Deep Semantic Model for Code Search using CodeSearchNet Corpus [17.6095840480926]
We propose a novel deep semantic model which makes use of the utilities of multi-modal sources.
We apply the proposed model to tackle the CodeSearchNet challenge about semantic code search.
Our model is trained on CodeSearchNet corpus and evaluated on the held-out data, the final model achieves 0.384 NDCG and won the first place in this benchmark.
arXiv Detail & Related papers (2022-01-27T04:15:59Z) - BERT2Code: Can Pretrained Language Models be Leveraged for Code Search? [0.7953229555481884]
We show that our model learns the inherent relationship between the embedding spaces and further probes into the scope of improvement.
In this analysis, we show that the quality of the code embedding model is the bottleneck for our model's performance.
arXiv Detail & Related papers (2021-04-16T10:28:27Z) - Deep Graph Matching and Searching for Semantic Code Retrieval [76.51445515611469]
We propose an end-to-end deep graph matching and searching model based on graph neural networks.
We first represent both natural language query texts and programming language code snippets with the unified graph-structured data.
In particular, DGMS not only captures more structural information for individual query texts or code snippets but also learns the fine-grained similarity between them.
arXiv Detail & Related papers (2020-10-24T14:16:50Z) - COSEA: Convolutional Code Search with Layer-wise Attention [90.35777733464354]
We propose a new deep learning architecture, COSEA, which leverages convolutional neural networks with layer-wise attention to capture the code's intrinsic structural logic.
COSEA can achieve significant improvements over state-of-the-art methods on code search tasks.
arXiv Detail & Related papers (2020-10-19T13:53:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.