What Papers Don't Tell You: Recovering Tacit Knowledge for Automated Paper Reproduction
- URL: http://arxiv.org/abs/2603.01801v1
- Date: Mon, 02 Mar 2026 12:33:31 GMT
- Title: What Papers Don't Tell You: Recovering Tacit Knowledge for Automated Paper Reproduction
- Authors: Lehui Li, Ruining Wang, Haochen Song, Yaoxin Mao, Tong Zhang, Yuyao Wang, Jiayi Fan, Yitong Zhang, Jieping Ye, Chengqi Zhang, Yongshun Gong,
- Abstract summary: method is a graph-based agent framework for generating executable code from academic papers.<n>On an extended ReproduceBench spanning 3 domains, 10 tasks, and 40 recent papers, method achieves an average performance gap of 10.04% against official implementations.
- Score: 57.86097956633207
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated paper reproduction -- generating executable code from academic papers -- is bottlenecked not by information retrieval but by the tacit knowledge that papers inevitably leave implicit. We formalize this challenge as the progressive recovery of three types of tacit knowledge -- relational, somatic, and collective -- and propose \method, a graph-based agent framework with a dedicated mechanism for each: node-level relation-aware aggregation recovers relational knowledge by analyzing implementation-unit-level reuse and adaptation relationships between the target paper and its citation neighbors; execution-feedback refinement recovers somatic knowledge through iterative debugging driven by runtime signals; and graph-level knowledge induction distills collective knowledge from clusters of papers sharing similar implementations. On an extended ReproduceBench spanning 3 domains, 10 tasks, and 40 recent papers, \method{} achieves an average performance gap of 10.04\% against official implementations, improving over the strongest baseline by 24.68\%. The code will be publicly released upon acceptance; the repository link will be provided in the final version.
Related papers
- Enhancing Automated Paper Reproduction via Prompt-Free Collaborative Agents [8.185402940269794]
We propose a prompt-free collaborative agent framework that automatically enhances the quality of paper-to-code generation.<n>Our approach employs two collaborative agents: a verification agent that examines whether the outputs at each step satisfy the requirements specified in the corresponding system prompt, and a refinement agent that revises the outputs based on the identified issues.
arXiv Detail & Related papers (2025-12-02T14:24:23Z) - Executable Knowledge Graphs for Replicating AI Research [65.41207324831583]
Executable Knowledge Graphs (xKG) is a modular and pluggable knowledge base that automatically integrates technical insights, code snippets, and domain-specific knowledge extracted from scientific literature.<n>Code will released at https://github.com/zjunlp/xKG.
arXiv Detail & Related papers (2025-10-20T17:53:23Z) - RepoSummary: Feature-Oriented Summarization and Documentation Generation for Code Repositories [7.744086870383438]
RepoSummary is a feature-oriented code repository summarization approach.<n>It simultaneously generates repository documentation automatically.<n>It establishes more accurate traceability links from functional features to the corresponding code elements.
arXiv Detail & Related papers (2025-10-13T06:16:44Z) - Reflective Paper-to-Code Reproduction Enabled by Fine-Grained Verification [46.845133190560375]
Motivated by how humans use systematic checklists to efficiently debug complex code, we propose textbfRePro, a textbfReflective Paper-to-Code textbfReproduction framework.<n>It automatically extracts a paper's fingerprint, referring to a comprehensive set of accurate and atomic criteria serving as high-quality supervisory signals.<n>It achieves 13.0% performance gap over baselines, and it correctly revises complex logical and mathematical criteria in reflecting.
arXiv Detail & Related papers (2025-08-21T06:57:44Z) - Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models [44.31597857713689]
We introduce Active Indexing for the first stage, which creates generalizable, source-anchored bindings.<n>Experiments with Qwen-2.5-7B&3B show that Active Indexing consistently outperforms a Passive Indexing baseline.<n> internal citations complement external ones by making the model more robust to retrieval noise.
arXiv Detail & Related papers (2025-06-21T04:48:05Z) - Learning Refined Document Representations for Dense Retrieval via Deliberate Thinking [58.69615583599489]
Deliberate Thinking based Retriever (Debater) is a novel approach that enhances document representations by incorporating a step-by-step thinking process.<n>Debater significantly outperforms existing methods across several retrieval benchmarks.
arXiv Detail & Related papers (2025-02-18T15:56:34Z) - Multi-Facet Blending for Faceted Query-by-Example Retrieval [5.156059061769101]
We propose a multi-facet blending (FaBle) augmentation method, which exploits modularity by decomposing and recomposing to explicitly synthesize facet-specific training sets.<n>Our modularization eliminates the need for pre-defined facet knowledge or labels.<n>FaBle augmentation on 1K documents remarkably assists training in obtaining facet conditional embeddings.
arXiv Detail & Related papers (2024-12-02T12:32:19Z) - Consistency Guided Knowledge Retrieval and Denoising in LLMs for
Zero-shot Document-level Relation Triplet Extraction [43.50683283748675]
Document-level Relation Triplet Extraction (DocRTE) is a fundamental task in information systems that aims to simultaneously extract entities with semantic relations from a document.
Existing methods heavily rely on a substantial amount of fully labeled data.
Recent advanced Large Language Models (LLMs), such as ChatGPT and LLaMA, exhibit impressive long-text generation capabilities.
arXiv Detail & Related papers (2024-01-24T17:04:28Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - Integrating Semantics and Neighborhood Information with Graph-Driven
Generative Models for Document Retrieval [51.823187647843945]
In this paper, we encode the neighborhood information with a graph-induced Gaussian distribution, and propose to integrate the two types of information with a graph-driven generative model.
Under the approximation, we prove that the training objective can be decomposed into terms involving only singleton or pairwise documents, enabling the model to be trained as efficiently as uncorrelated ones.
arXiv Detail & Related papers (2021-05-27T11:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.