ArchAgent: Scalable Legacy Software Architecture Recovery with LLMs
- URL: http://arxiv.org/abs/2601.13007v1
- Date: Mon, 19 Jan 2026 12:39:05 GMT
- Title: ArchAgent: Scalable Legacy Software Architecture Recovery with LLMs
- Authors: Rusheng Pan, Bingcheng Mao, Tianyi Ma, Zhenhua Ling,
- Abstract summary: ArchAgent is a scalable agent-based framework that combines static analysis, adaptive code segmentation, and LLM-powered synthesis.<n>It reconstructs multiview, business-aligned architectures from cross-repositorys.<n>ArchAgent introduces scalable diagram generation with contextual pruning and integrates cross-repository data to identify business-critical modules.
- Score: 44.137226823695066
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recovering accurate architecture from large-scale legacy software is hindered by architectural drift, missing relations, and the limited context of Large Language Models (LLMs). We present ArchAgent, a scalable agent-based framework that combines static analysis, adaptive code segmentation, and LLM-powered synthesis to reconstruct multiview, business-aligned architectures from cross-repository codebases. ArchAgent introduces scalable diagram generation with contextual pruning and integrates cross-repository data to identify business-critical modules. Evaluations of typical large-scale GitHub projects show significant improvements over existing benchmarks. An ablation study confirms that dependency context improves the accuracy of generated architectures of production-level repositories, and a real-world case study demonstrates effective recovery of critical business logics from legacy projects. The dataset is available at https://github.com/panrusheng/arch-eval-benchmark.
Related papers
- Architecture-Aware Multi-Design Generation for Repository-Level Feature Addition [53.50448142467294]
RAIM is a multi-design and architecture-aware framework for repository-level feature addition.<n>It shifts away from linear patching by generating multiple diverse implementation designs.<n>Experiments on the NoCode-bench Verified dataset demonstrate that RAIM establishes a new state-of-the-art performance.
arXiv Detail & Related papers (2026-03-02T12:50:40Z) - ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development [72.4729759618632]
We introduce ABC-Bench, a benchmark to evaluate agentic backend coding within a realistic, executable workflow.<n>We curated 224 practical tasks spanning 8 languages and 19 frameworks from open-source repositories.<n>Our evaluation reveals that even state-of-the-art models struggle to deliver reliable performance on these holistic tasks.
arXiv Detail & Related papers (2026-01-16T08:23:52Z) - Towards Realistic Project-Level Code Generation via Multi-Agent Collaboration and Semantic Architecture Modeling [7.753074942497876]
We introduce CodeProjectEval, a project-level code generation dataset built from 18 real-world repositories with 12.7 files and 2,388.6 lines of code per task.<n>We propose ProjectGen, a multi-agent framework that decomposes projects into architecture design, skeleton generation, and code filling stages.<n>Experiments show that ProjectGen achieves state-of-the-art performance, passing 52/124 test cases on the small-scale project-level code generation dataset DevBench.
arXiv Detail & Related papers (2025-11-05T12:12:35Z) - LM-Searcher: Cross-domain Neural Architecture Search with LLMs via Unified Numerical Encoding [55.5535016040221]
LM-Searcher is a novel framework for cross-domain neural architecture optimization.<n>Central to our approach is NCode, a universal numerical string representation for neural architectures.<n>Our dataset, encompassing a wide range of architecture-performance pairs, encourages robust and transferable learning.
arXiv Detail & Related papers (2025-09-06T09:26:39Z) - SaraCoder: Orchestrating Semantic and Structural Cues for Resource-Optimized Repository-Level Code Completion [34.41683042851225]
We propose a resource-optimized retrieval augmentation method, SaraCoder.<n>It maximizes information diversity and representativeness in a limited context window.<n>Our work proves that systematically refining retrieval results across multiple dimensions provides a new paradigm for building more accurate and resource-optimized repository-level code completion systems.
arXiv Detail & Related papers (2025-08-13T11:56:05Z) - Large Language Models are Good Relational Learners [55.40941576497973]
We introduce Rel-LLM, a novel architecture that utilizes a graph neural network (GNN)- based encoder to generate structured relational prompts for large language models (LLMs)<n>Unlike traditional text-based serialization approaches, our method preserves the inherent relational structure of databases while enabling LLMs to process and reason over complex entity relationships.
arXiv Detail & Related papers (2025-06-06T04:07:55Z) - Class-Level Code Generation from Natural Language Using Iterative, Tool-Enhanced Reasoning over Repository [4.767858874370881]
We introduce RepoClassBench, a benchmark designed to rigorously evaluate LLMs in generating class-level code within real-world repositories.
RepoClassBench includes "Natural Language to Class generation" tasks across Java, Python & C# from a selection of repositories.
We introduce Retrieve-Repotools-Reflect (RRR), a novel approach that equips LLMs with static analysis tools to iteratively navigate & reason about repository-level context.
arXiv Detail & Related papers (2024-04-22T03:52:54Z) - Serving Deep Learning Model in Relational Databases [70.53282490832189]
Serving deep learning (DL) models on relational data has become a critical requirement across diverse commercial and scientific domains.
We highlight three pivotal paradigms: The state-of-the-art DL-centric architecture offloads DL computations to dedicated DL frameworks.
The potential UDF-centric architecture encapsulates one or more tensor computations into User Defined Functions (UDFs) within the relational database management system (RDBMS)
arXiv Detail & Related papers (2023-10-07T06:01:35Z) - Operation Embeddings for Neural Architecture Search [15.033712726016255]
We propose the replacement of fixed operator encoding with learnable representations in the optimization process.
Our method produces top-performing architectures that share similar operation and graph patterns.
arXiv Detail & Related papers (2021-05-11T09:17:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.