CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases
- URL: http://arxiv.org/abs/2510.24428v2
- Date: Thu, 30 Oct 2025 01:38:03 GMT
- Title: CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases
- Authors: Anh Nguyen Hoang, Minh Le-Anh, Bach Le, Nghi D. Q. Bui,
- Abstract summary: We present bftextCodeWiki, a unified framework for automated repository-level documentation across seven programming languages.<n>CodeWiki introduces three key innovations: (i) hierarchical decomposition that preserves architectural context across multiple levels of granularity, (ii) recursive multi-agent processing with dynamic task delegation for scalable generation, and (iii) multi-modal synthesis that integrates textual descriptions with visual artifacts such as architecture diagrams and data-flow representations.<n>CodeWiki achieves a 68.79% quality score with proprietary models, outperforming the closed-source DeepWiki baseline (64.06%) by 4.73%
- Score: 7.75137961900221
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given a large and evolving codebase, the ability to automatically generate holistic, architecture-aware documentation that captures not only individual functions but also cross-file, cross-module, and system-level interactions remains an open challenge. Comprehensive documentation is essential for long-term software maintenance and collaboration, yet current automated approaches still fail to model the rich semantic dependencies and architectural structures that define real-world software systems. We present \textbf{CodeWiki}, a unified framework for automated repository-level documentation across seven programming languages. CodeWiki introduces three key innovations: (i) hierarchical decomposition that preserves architectural context across multiple levels of granularity, (ii) recursive multi-agent processing with dynamic task delegation for scalable generation, and (iii) multi-modal synthesis that integrates textual descriptions with visual artifacts such as architecture diagrams and data-flow representations. To enable rigorous evaluation, we introduce \textbf{CodeWikiBench}, a comprehensive benchmark featuring multi-dimensional rubrics and LLM-based assessment protocols. Experimental results show that CodeWiki achieves a 68.79\% quality score with proprietary models, outperforming the closed-source DeepWiki baseline (64.06\%) by 4.73\%, with particularly strong improvements on high-level scripting languages (+10.47\%). We open-source CodeWiki to foster future research and community adoption.
Related papers
- Architecture-Aware Multi-Design Generation for Repository-Level Feature Addition [53.50448142467294]
RAIM is a multi-design and architecture-aware framework for repository-level feature addition.<n>It shifts away from linear patching by generating multiple diverse implementation designs.<n>Experiments on the NoCode-bench Verified dataset demonstrate that RAIM establishes a new state-of-the-art performance.
arXiv Detail & Related papers (2026-03-02T12:50:40Z) - Integrating Code Metrics into Automated Documentation Generation for Computational Notebooks [0.18665975431697424]
This paper investigates the role of source code metrics as auxiliary signals for automated documentation generation.<n>It focuses on computational notebooks, a popular medium among data scientists that integrates code, narrative, and results but suffers from inconsistent documentation.<n>Results show that incorporating code metrics improves the accuracy and contextual relevance of generated documentation.
arXiv Detail & Related papers (2026-02-08T21:40:57Z) - SpecMap: Hierarchical LLM Agent for Datasheet-to-Code Traceability Link Recovery in Systems Engineering [8.235446273226277]
Traceability between embedded systemss and their corresponding code implementations is a fundamental challenge in systems engineering.<n>Existing Traceability Link Recovery approaches rely on lexical similarity and information retrieval techniques.<n>We present a hierarchical-to-code mapping methodology that employs large language models for semantic analysis.
arXiv Detail & Related papers (2026-01-16T11:50:18Z) - Completion by Comprehension: Guiding Code Generation with Multi-Granularity Understanding [37.78627994991325]
CoCo is a novel framework that enables code Completion by of multi-granularity context from large-scale code repositories.<n>Experiments on CrossCodeEval and RepoEval benchmarks demonstrate that CoCo consistently surpasses state-of-the-art baselines.
arXiv Detail & Related papers (2025-12-04T07:37:59Z) - Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding [61.36285696607487]
Document understanding is critical for applications from financial analysis to scientific discovery.<n>Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs) face key limitations.<n>Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG.
arXiv Detail & Related papers (2025-10-17T02:33:16Z) - RepoSummary: Feature-Oriented Summarization and Documentation Generation for Code Repositories [7.744086870383438]
RepoSummary is a feature-oriented code repository summarization approach.<n>It simultaneously generates repository documentation automatically.<n>It establishes more accurate traceability links from functional features to the corresponding code elements.
arXiv Detail & Related papers (2025-10-13T06:16:44Z) - Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers [103.4410890572479]
We introduce the Loong Project: an open-source framework for scalable synthetic data generation and verification.<n>LoongBench is a curated seed dataset containing 8,729 human-vetted examples across 12 domains.<n>LoongEnv is a modular synthetic data generation environment that supports multiple prompting strategies to produce new question-answer-code triples.
arXiv Detail & Related papers (2025-09-03T06:42:40Z) - DocRefine: An Intelligent Framework for Scientific Document Understanding and Content Optimization based on Multimodal Large Model Agents [25.190790899297788]
DocRefine is an innovative framework designed for intelligent understanding, content refinement, and automated summarization of scientific PDF documents.<n>It orchestrates a sophisticated multi-agent system comprising six specialized and collaborative agents.<n>It consistently outperforms state-of-the-art baselines across various tasks.
arXiv Detail & Related papers (2025-08-09T15:32:52Z) - Zero-Shot Document Understanding using Pseudo Table of Contents-Guided Retrieval-Augmented Generation [4.875345207589195]
DocsRay is a training-free document understanding system.<n>It integrates pseudo Table of Contents (TOC) generation with hierarchical Retrieval-Augmented Generation (RAG)
arXiv Detail & Related papers (2025-07-31T03:14:45Z) - Docopilot: Improving Multimodal Models for Document-Level Understanding [87.60020625241178]
We present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents.<n>This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from the original documents.<n>Building on the dataset, we develop a native multimodal model, Docopilot, which can accurately handle document-level dependencies without relying on RAG.
arXiv Detail & Related papers (2025-07-19T16:03:34Z) - OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models [58.45517851437422]
Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding.<n>Existing solutions often rely on task-specific architectures and objectives for individual tasks.<n>In this paper, we introduce Omni V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis.
arXiv Detail & Related papers (2025-02-22T09:32:01Z) - CoIR: A Comprehensive Benchmark for Code Information Retrieval Models [52.61625841028781]
COIR (Code Information Retrieval Benchmark) is a robust and comprehensive benchmark designed to assess code retrieval capabilities.<n>COIR comprises ten meticulously curated code datasets, spanning eight distinctive retrieval tasks across seven diverse domains.<n>We evaluate nine widely used retrieval models using COIR, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems.
arXiv Detail & Related papers (2024-07-03T07:58:20Z) - Alibaba LingmaAgent: Improving Automated Issue Resolution via Comprehensive Repository Exploration [64.19431011897515]
This paper presents Alibaba LingmaAgent, a novel Automated Software Engineering method designed to comprehensively understand and utilize whole software repositories for issue resolution.<n>Our approach introduces a top-down method to condense critical repository information into a knowledge graph, reducing complexity, and employs a Monte Carlo tree search based strategy.<n>In production deployment and evaluation at Alibaba Cloud, LingmaAgent automatically resolved 16.9% of in-house issues faced by development engineers, and solved 43.3% of problems after manual intervention.
arXiv Detail & Related papers (2024-06-03T15:20:06Z) - RepoAgent: An LLM-Powered Open-Source Framework for Repository-level
Code Documentation Generation [79.83270415843857]
We introduce RepoAgent, a large language model powered open-source framework aimed at proactively generating, maintaining, and updating code documentation.
We have validated the effectiveness of our approach, showing that RepoAgent excels in generating high-quality repository-level documentation.
arXiv Detail & Related papers (2024-02-26T15:39:52Z) - Generate rather than Retrieve: Large Language Models are Strong Context
Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators.
We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z) - Topical: Learning Repository Embeddings from Source Code using Attention [3.110769442802435]
This paper presents Topical, a novel deep neural network for repository level embeddings.
The attention mechanism generates repository-level representations from source code, full dependency graphs, and script level textual data.
arXiv Detail & Related papers (2022-08-19T18:13:27Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - Assessing the quality of sources in Wikidata across languages: a hybrid
approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.