PaperGuide: Making Small Language-Model Paper-Reading Agents More Efficient
- URL: http://arxiv.org/abs/2601.12988v1
- Date: Mon, 19 Jan 2026 12:07:51 GMT
- Title: PaperGuide: Making Small Language-Model Paper-Reading Agents More Efficient
- Authors: Zijian Wang, Tiancheng Huang, Hanqi Li, Da Ma, Lu Chen, Kai Yu,
- Abstract summary: Recent progress in large language models (LLMs) has spurred interest in autonomous agents that can read scientific papers and extract task-relevant information.<n>Most existing approaches rely either on heavily engineered prompting or on a conventional SFT-RL training pipeline.<n>We propose Paper RL, a framework that mitigates these issues by separating high-level planning from fine-grained execution.
- Score: 20.72001543887772
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The accelerating growth of the scientific literature makes it increasingly difficult for researchers to track new advances through manual reading alone. Recent progress in large language models (LLMs) has therefore spurred interest in autonomous agents that can read scientific papers and extract task-relevant information. However, most existing approaches rely either on heavily engineered prompting or on a conventional SFT-RL training pipeline, both of which often lead to excessive and low-yield exploration. Drawing inspiration from cognitive science, we propose PaperCompass, a framework that mitigates these issues by separating high-level planning from fine-grained execution. PaperCompass first drafts an explicit plan that outlines the intended sequence of actions, and then performs detailed reasoning to instantiate each step by selecting the parameters for the corresponding function calls. To train such behavior, we introduce Draft-and-Follow Policy Optimization (DFPO), a tailored RL method that jointly optimizes both the draft plan and the final solution. DFPO can be viewed as a lightweight form of hierarchical reinforcement learning, aimed at narrowing the `knowing-doing' gap in LLMs. We provide a theoretical analysis that establishes DFPO's favorable optimization properties, supporting a stable and reliable training process. Experiments on paper-based question answering (Paper-QA) benchmarks show that PaperCompass improves efficiency over strong baselines without sacrificing performance, achieving results comparable to much larger models.
Related papers
- PaperScout: An Autonomous Agent for Academic Paper Search with Process-Aware Sequence-Level Policy Optimization [11.080060663295072]
PaperScout is an autonomous agent that reformulates paper search as a sequential decision-making process.<n>We introduce Proximal Sequence Policy Optimization (PSPO), a process-aware, sequence-level policy optimization method.<n>Experiments on both synthetic and real-world benchmarks demonstrate that PaperScout significantly outperforms strong workflow-driven and RL baselines in both recall and relevance.
arXiv Detail & Related papers (2026-01-15T03:21:21Z) - CogDoc: Towards Unified thinking in Documents [53.41571589733423]
We propose a unified coarse-to-fine thinking framework that mimics human cognitive processes: a low-resolution "Fast Reading" phase for scalable information localization, followed by a high-resolution "Focused Thinking" phase for deep reasoning.<n>We conduct a rigorous investigation into post-training strategies for the unified thinking framework, demonstrating that a Direct Reinforcement Learning approach outperforms RL with Supervised Fine-Tuning (SFT)<n>Specifically, we find that direct RL avoids the "policy conflict" observed in SFT.
arXiv Detail & Related papers (2025-12-14T12:14:17Z) - Rethinking On-policy Optimization for Query Augmentation [49.87723664806526]
We present the first systematic comparison of prompting-based and RL-based query augmentation across diverse benchmarks.<n>We introduce a novel hybrid method, On-policy Pseudo-document Query Expansion (OPQE), which learns to generate a pseudo-document that maximizes retrieval performance.
arXiv Detail & Related papers (2025-10-20T04:16:28Z) - Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections [65.36449542323277]
We present a unified theoretical framework bridgingSupervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training.<n>We propose a simple yet effective learning rate reduction approach that yields significant performance improvements.
arXiv Detail & Related papers (2025-06-15T05:42:29Z) - Direct Retrieval-augmented Optimization: Synergizing Knowledge Selection and Language Models [83.8639566087953]
We propose a direct retrieval-augmented optimization framework, named DRO, that enables end-to-end training of two key components.<n>DRO alternates between two phases: (i) document permutation estimation and (ii) re-weighted, progressively improving RAG components.<n>Our theoretical analysis reveals that DRO is analogous to policy-gradient methods in reinforcement learning.
arXiv Detail & Related papers (2025-05-05T23:54:53Z) - A Survey of Direct Preference Optimization [103.59317151002693]
Large Language Models (LLMs) have demonstrated unprecedented generative capabilities.<n>Their alignment with human values remains critical for ensuring helpful and harmless deployments.<n>Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative.
arXiv Detail & Related papers (2025-03-12T08:45:15Z) - RaCT: Ranking-aware Chain-of-Thought Optimization for LLMs [30.216174551427443]
Large language models (LLMs) have demonstrated remarkable potential in text reranking tasks.<n> conventional supervised fine-tuning approaches for specializing LLMs in ranking tasks often lead to significant degradation of the models' general-purpose abilities.<n>This paper presents a novel methodology that strategically combines Chain-of-Thought (CoT) prompting techniques with an innovative two-stage training pipeline.
arXiv Detail & Related papers (2024-12-18T23:24:15Z) - In-context Demonstration Matters: On Prompt Optimization for Pseudo-Supervision Refinement [71.60563181678323]
Large language models (LLMs) have achieved great success across diverse tasks, and fine-tuning is sometimes needed to further enhance generation quality.<n>To handle these challenges, a direct solution is to generate high-confidence'' data from unsupervised downstream tasks.<n>We propose a novel approach, pseudo-supervised demonstrations aligned prompt optimization (PAPO) algorithm, which jointly refines both the prompt and the overall pseudo-supervision.
arXiv Detail & Related papers (2024-10-04T03:39:28Z) - The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities [0.35998666903987897]
This report examines the fine-tuning of Large Language Models (LLMs)
It outlines the historical evolution of LLMs from traditional Natural Language Processing (NLP) models to their pivotal role in AI.
The report introduces a structured seven-stage pipeline for fine-tuning LLMs.
arXiv Detail & Related papers (2024-08-23T14:48:02Z) - A Survey on Efficient Inference for Large Language Models [25.572035747669275]
Large Language Models (LLMs) have attracted extensive attention due to their remarkable performance across various tasks.
The substantial computational and memory requirements of LLM inference pose challenges for deployment in resource-constrained scenarios.
This paper presents a comprehensive survey of the existing literature on efficient LLM inference.
arXiv Detail & Related papers (2024-04-22T15:53:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.