Retrieval-Augmented Code Generation: A Survey with Focus on Repository-Level Approaches
- URL: http://arxiv.org/abs/2510.04905v1
- Date: Mon, 06 Oct 2025 15:20:03 GMT
- Title: Retrieval-Augmented Code Generation: A Survey with Focus on Repository-Level Approaches
- Authors: Yicheng Tao, Yao Qin, Yepang Liu,
- Abstract summary: Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm that integrates external retrieval mechanisms with LLMs.<n>We provide a comprehensive review of research on Retrieval-Augmented Code Generation (RACG), with an emphasis on repository-level approaches.<n>Our goal is to establish a unified analytical framework for understanding this rapidly evolving field and to inspire continued progress in AI-powered software engineering.
- Score: 6.740646039135986
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in large language models (LLMs) have substantially improved automated code generation. While function-level and file-level generation have achieved promising results, real-world software development typically requires reasoning across entire repositories. This gives rise to the challenging task of Repository-Level Code Generation (RLCG), where models must capture long-range dependencies, ensure global semantic consistency, and generate coherent code spanning multiple files or modules. To address these challenges, Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm that integrates external retrieval mechanisms with LLMs, enhancing context-awareness and scalability. In this survey, we provide a comprehensive review of research on Retrieval-Augmented Code Generation (RACG), with an emphasis on repository-level approaches. We categorize existing work along several dimensions, including generation strategies, retrieval modalities, model architectures, training paradigms, and evaluation protocols. Furthermore, we summarize widely used datasets and benchmarks, analyze current limitations, and outline key challenges and opportunities for future research. Our goal is to establish a unified analytical framework for understanding this rapidly evolving field and to inspire continued progress in AI-powered software engineering.
Related papers
- Advances and Frontiers of LLM-based Issue Resolution in Software Engineering: A Comprehensive Survey [59.3507264893654]
Issue resolution is a complex Software Engineering task integral to real-world development.<n> benchmarks like SWE-bench revealed this task as profoundly difficult for large language models.<n>This paper presents a systematic survey of this emerging domain.
arXiv Detail & Related papers (2026-01-15T18:55:03Z) - Deep Research: A Systematic Survey [118.82795024422722]
Deep Research (DR) aims to combine the reasoning capabilities of large language models with external tools, such as search engines.<n>This survey presents a comprehensive and systematic overview of deep research systems.
arXiv Detail & Related papers (2025-11-24T15:28:28Z) - Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding [61.36285696607487]
Document understanding is critical for applications from financial analysis to scientific discovery.<n>Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs) face key limitations.<n>Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG.
arXiv Detail & Related papers (2025-10-17T02:33:16Z) - Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers [103.4410890572479]
We introduce the Loong Project: an open-source framework for scalable synthetic data generation and verification.<n>LoongBench is a curated seed dataset containing 8,729 human-vetted examples across 12 domains.<n>LoongEnv is a modular synthetic data generation environment that supports multiple prompting strategies to produce new question-answer-code triples.
arXiv Detail & Related papers (2025-09-03T06:42:40Z) - Towards Repository-Level Program Verification with Large Language Models [8.05666536952624]
scaling automated formal verification to real-world projects requires resolving cross-module dependencies and global contexts.<n>We introduce RVBench, the first verification benchmark explicitly designed for repository-level evaluation, constructed from four diverse and complex open-source Verus projects.<n>RagedVerus is a framework that synergizes retrieval-augmented generation with context-aware prompting to automate proof for multi-module repositories.
arXiv Detail & Related papers (2025-08-31T02:44:04Z) - A Survey on Code Generation with LLM-based Agents [61.474191493322415]
Code generation agents powered by large language models (LLMs) are revolutionizing the software development paradigm.<n>LLMs are characterized by three core features.<n>This paper presents a systematic survey of the field of LLM-based code generation agents.
arXiv Detail & Related papers (2025-07-31T18:17:36Z) - Deep Research Agents: A Systematic Examination And Roadmap [109.53237992384872]
Deep Research (DR) agents are designed to tackle complex, multi-turn informational research tasks.<n>In this paper, we conduct a detailed analysis of the foundational technologies and architectural components that constitute DR agents.
arXiv Detail & Related papers (2025-06-22T16:52:48Z) - Retrieval-Augmented Generation: A Comprehensive Survey of Architectures, Enhancements, and Robustness Frontiers [0.0]
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm to enhance large language models.<n>RAG introduces new challenges in retrieval quality, grounding fidelity, pipeline efficiency, and robustness against noisy or adversarial inputs.<n>This survey aims to consolidate current knowledge in RAG research and serve as a foundation for the next generation of retrieval-augmented language modeling systems.
arXiv Detail & Related papers (2025-05-28T22:57:04Z) - A Survey on Knowledge-Oriented Retrieval-Augmented Generation [45.65542434522205]
Retrieval-Augmented Generation (RAG) has gained significant attention in recent years.<n>RAG combines large-scale retrieval systems with generative models.<n>We discuss the key characteristics of RAG, such as its ability to augment generative models with dynamic external knowledge.
arXiv Detail & Related papers (2025-03-11T01:59:35Z) - Retrieval-Enhanced Machine Learning: Synthesis and Opportunities [60.34182805429511]
Retrieval-enhancement can be extended to a broader spectrum of machine learning (ML)
This work introduces a formal framework of this paradigm, Retrieval-Enhanced Machine Learning (REML), by synthesizing the literature in various domains in ML with consistent notations which is missing from the current literature.
The goal of this work is to equip researchers across various disciplines with a comprehensive, formally structured framework of retrieval-enhanced models, thereby fostering interdisciplinary future research.
arXiv Detail & Related papers (2024-07-17T20:01:21Z) - A Survey on Retrieval-Augmented Text Generation for Large Language Models [1.4579344926652844]
Retrieval-Augmented Generation (RAG) merges retrieval methods with deep learning advancements.
This paper organizes the RAG paradigm into four categories: pre-retrieval, retrieval, post-retrieval, and generation.
It outlines RAG's evolution and discusses the field's progression through the analysis of significant studies.
arXiv Detail & Related papers (2024-04-17T01:27:42Z) - Retrieval-Augmented Generation for Large Language Models: A Survey [17.82361213043507]
Large Language Models (LLMs) showcase impressive capabilities but encounter challenges like hallucination.
Retrieval-Augmented Generation (RAG) has emerged as a promising solution by incorporating knowledge from external databases.
arXiv Detail & Related papers (2023-12-18T07:47:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.