Related papers: CogDoc: Towards Unified thinking in Documents

CogDoc: Towards Unified thinking in Documents

URL: http://arxiv.org/abs/2512.12658v1
Date: Sun, 14 Dec 2025 12:14:17 GMT
Title: CogDoc: Towards Unified thinking in Documents
Authors: Qixin Xu, Haozhe Wang, Che Liu, Fangzhen Lin, Wenhu Chen,
Abstract summary: We propose a unified coarse-to-fine thinking framework that mimics human cognitive processes: a low-resolution "Fast Reading" phase for scalable information localization, followed by a high-resolution "Focused Thinking" phase for deep reasoning.<n>We conduct a rigorous investigation into post-training strategies for the unified thinking framework, demonstrating that a Direct Reinforcement Learning approach outperforms RL with Supervised Fine-Tuning (SFT)<n>Specifically, we find that direct RL avoids the "policy conflict" observed in SFT.
Score: 53.41571589733423
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current document reasoning paradigms are constrained by a fundamental trade-off between scalability (processing long-context documents) and fidelity (capturing fine-grained, multimodal details). To bridge this gap, we propose CogDoc, a unified coarse-to-fine thinking framework that mimics human cognitive processes: a low-resolution "Fast Reading" phase for scalable information localization,followed by a high-resolution "Focused Thinking" phase for deep reasoning. We conduct a rigorous investigation into post-training strategies for the unified thinking framework, demonstrating that a Direct Reinforcement Learning (RL) approach outperforms RL with Supervised Fine-Tuning (SFT) initialization. Specifically, we find that direct RL avoids the "policy conflict" observed in SFT. Empirically, our 7B model achieves state-of-the-art performance within its parameter class, notably surpassing significantly larger proprietary models (e.g., GPT-4o) on challenging, visually rich document benchmarks.

Related papers

LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval [74.72139580745511]
LaSER is a novel self-distillation framework that internalizes explicit reasoning into the latent space of retrievers.<n>Our method successfully combines the reasoning depth of explicit CoT pipelines with the inference efficiency of standard dense retrievers.
arXiv Detail & Related papers (2026-03-02T04:11:18Z)
Answer First, Reason Later: Aligning Search Relevance via Mode-Balanced Reinforcement Learning [7.006180736433431]
Building a search relevance model that achieves both low latency and high performance is a long-standing challenge in the search industry.<n>We propose a novel textbfAnswer-First, Reason Later (AFRL) paradigm.<n>This paradigm requires the model to output the definitive relevance score in the very first token, followed by a structured logical explanation.
arXiv Detail & Related papers (2026-02-10T17:28:12Z)
PaperGuide: Making Small Language-Model Paper-Reading Agents More Efficient [20.72001543887772]
Recent progress in large language models (LLMs) has spurred interest in autonomous agents that can read scientific papers and extract task-relevant information.<n>Most existing approaches rely either on heavily engineered prompting or on a conventional SFT-RL training pipeline.<n>We propose Paper RL, a framework that mitigates these issues by separating high-level planning from fine-grained execution.
arXiv Detail & Related papers (2026-01-19T12:07:51Z)
VAR: Visual Attention Reasoning via Structured Search and Backtracking [49.427842994857635]
We introduce Visual Attention Reasoning, a framework that recasts grounded reasoning as a structured search.<n> VAR decomposes the reasoning process into two key stages: traceable evidence grounding and search-based chain-of-thought.<n>We show that our 7B model, VAR-7B, sets a new state-of-the-art on a comprehensive suite of hallucination and safety benchmarks.
arXiv Detail & Related papers (2025-10-21T13:18:44Z)
Step-Aware Policy Optimization for Reasoning in Diffusion Large Language Models [57.42778606399764]
Diffusion language models (dLLMs) offer a promising, non-autoregressive paradigm for text generation.<n>Current reinforcement learning approaches often rely on sparse, outcome-based rewards.<n>We argue that this stems from a fundamental mismatch with the natural structure of reasoning.
arXiv Detail & Related papers (2025-10-02T00:34:15Z)
DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding [66.07724324530844]
We propose DocThinker, a rule-based Reinforcement Learning framework for dynamic inference-time reasoning.<n>Our method mitigates catastrophic forgetting and enhances both adaptability and transparency.<n>Our findings highlight RL as a powerful alternative for enhancing explainability and adaptability in MLLM-based document understanding.
arXiv Detail & Related papers (2025-08-12T03:06:55Z)
Metis-RISE: RL Incentivizes and SFT Enhances Multimodal Reasoning Model Learning [20.515599491717442]
We introduce textbfMetis-RISE (textbfRL textbfSFT textbfEnhances) for multimodal reasoning model learning.
arXiv Detail & Related papers (2025-06-16T02:56:13Z)
Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning [23.00801828244201]
This paper proposes a novel RL framework called textbfVision-EKIPL.<n>It introduces high-quality actions generated by external auxiliary models during the RL training process to guide the optimization of the policy model.<n>It achieves up to a 5% performance improvement on the Reason-RFT-CoT Benchmark compared to the state-of-the-art (SOTA)
arXiv Detail & Related papers (2025-06-07T16:37:46Z)
Model Steering: Learning with a Reference Model Improves Generalization Bounds and Scaling Laws [52.10468229008941]
This paper formalizes an emerging learning paradigm that uses a trained model as a reference to guide and enhance the training of a target model through strategic data selection or weighting.<n>We provide theoretical insights into why this approach improves generalization and data efficiency compared to training without a reference model.<n>Building on these insights, we introduce a novel method for Contrastive Language-Image Pretraining with a reference model, termed DRRho-CLIP.
arXiv Detail & Related papers (2025-05-10T16:55:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.