Related papers: Reading or Reasoning? Format Decoupled Reinforcement Learning for Document OCR

Reading or Reasoning? Format Decoupled Reinforcement Learning for Document OCR

URL: http://arxiv.org/abs/2601.08834v1
Date: Thu, 11 Dec 2025 12:24:45 GMT
Title: Reading or Reasoning? Format Decoupled Reinforcement Learning for Document OCR
Authors: Yufeng Zhong, Lei Chen, Zhixiong Zeng, Xuanle Zhao, Deyang Jiang, Liming Zheng, Jing Huang, Haibo Qiu, Peng Shi, Siqi Yang, Lin Ma,
Abstract summary: In this work, we observe that even advanced OCR models exhibit significantly higher entropy in formatted text.<n>We propose format decoupled reinforcement learning (FD-RL) which leverages high-entropy patterns for targeted optimization.<n>FD-RL achieves an average score of 90.41 on OmniDocBench, setting a new record for end-to-end models on this highly popular benchmark.
Score: 19.56202762840295
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reading text from images or scanned documents via OCR models has been a longstanding focus of researchers. Intuitively, text reading is perceived as a straightforward perceptual task, and existing work primarily focuses on constructing enriched data engineering to enhance SFT capabilities. In this work, we observe that even advanced OCR models exhibit significantly higher entropy in formatted text (\emph{e.g.}, formula, table, etc.) compared to plain text, often by an order of magnitude. These statistical patterns reveal that advanced OCR models struggle with high output uncertainty when dealing with format sensitive document, suggesting that reasoning over diverse reading pathways may improve OCR performance. To address this, we propose format decoupled reinforcement learning (FD-RL), which leverages high-entropy patterns for targeted optimization. Our approach employs entropy-based data filtration strategy to identify format-intensive instances, and adopt format decoupled rewards tailored to different format types, enabling format-level validation rather than token-level memorization. FD-RL achieves an average score of 90.41 on OmniDocBench, setting a new record for end-to-end models on this highly popular benchmark. More importantly, we conduct comprehensive ablation studies over data, training, filtering, and rewarding strategies, thoroughly validating their effectiveness.

Related papers

FireRed-OCR Technical Report [30.019999826760003]
We introduce FireRed-OCR, a framework to transform general-purpose VLMs into pixel-precise structural document parsing experts.<n>To address the scarcity of high-quality structured data, we construct a Geometry + Semantics'' Data Factory.<n>We propose a Three-Stage Progressive Training strategy that guides the model from pixel-level perception to logical structure generation.
arXiv Detail & Related papers (2026-03-02T13:19:23Z)
AgenticOCR: Parsing Only What You Need for Efficient Retrieval-Augmented Generation [35.07704681580893]
We introduce AgenticOCR, a dynamic parsing paradigm that transforms optical character recognition (OCR) into a query-driven, on-demand extraction system.<n>By autonomously analyzing document layout in a "thinking with images" manner, AgenticOCR identifies and selectively recognizes regions of interest.<n>AgenticOCR has the potential to serve as the "third building block" of the visual document RAG stack.
arXiv Detail & Related papers (2026-02-27T16:09:38Z)
Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding [102.88996030431662]
We propose a training-free and highly efficient acceleration method for document parsing tasks.<n>Inspired by speculative decoding, we employ a lightweight document parsing pipeline as a draft model to predict batches of future tokens.<n>We demonstrate the effectiveness of our approach on the general-purpose OmniDocBench.
arXiv Detail & Related papers (2026-02-13T14:22:10Z)
Infusing fine-grained visual knowledge to Vision-Language Models [5.487134463783365]
Large-scale contrastive pre-training produces powerful Vision-and-Language Models (VLMs)<n>We propose a fine-tuning method explicitly designed to achieve optimal balance between fine-grained domain adaptation and retention of the pretrained VLM's broad multimodal knowledge.<n>Our approach consistently achieves strong results, notably retaining the visual-text alignment without utilizing any text data or the original text encoder during fine-tuning.
arXiv Detail & Related papers (2025-08-16T19:12:09Z)
Writing-RL: Advancing Long-form Writing via Adaptive Curriculum Reinforcement Learning [55.41828729623907]
We present Writing-RL: an Adaptive Curriculum Reinforcement Learning framework to advance long-form writing capabilities beyond supervised fine-tuning.<n>The framework consists of three key components: Margin-aware Data Selection strategy that prioritizes samples with high learning potential, Pairwise Comparison Reward mechanism that provides discriminative learning signals, and Dynamic Reference Scheduling approach.
arXiv Detail & Related papers (2025-06-06T05:40:39Z)
Lightweight and Direct Document Relevance Optimization for Generative Information Retrieval [49.669503570350166]
Generative information retrieval (GenIR) is a promising neural retrieval paradigm that formulates document retrieval as a document identifier (docid) generation task.<n>Existing GenIR models suffer from token-level misalignment, where models trained to predict the next token often fail to capture document-level relevance effectively.<n>We propose direct document relevance optimization (DDRO), which aligns token-level docid generation with document-level relevance estimation through direct optimization via pairwise ranking.
arXiv Detail & Related papers (2025-04-07T15:27:37Z)
Discourse Features Enhance Detection of Document-Level Machine-Generated Content [53.41994768824785]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.<n>Existing MGC detectors often focus solely on surface-level information, overlooking implicit and structural features.<n>We introduce novel methodologies and datasets to overcome these challenges.
arXiv Detail & Related papers (2024-12-17T08:47:41Z)
Large Language Models for Page Stream Segmentation [0.03495246564946555]
Page Stream (PSS) is an essential prerequisite for automated document processing at scale. This paper introduces TABME++, an enhanced benchmark featuring commercial Optical Character Recognition (OCR) annotations. We evaluate the performance of large language models (LLMs) on PSS, focusing on decoder-based models fine-tuned with parameter-efficient methods.
arXiv Detail & Related papers (2024-08-21T20:28:42Z)
Mixed Text Recognition with Efficient Parameter Fine-Tuning and Transformer [12.966765239586994]
This paper proposes DLoRA-TrOCR, a parameter-efficient hybrid text spotting method based on a pre-trained OCR Transformer.<n>By embedding a weight-decomposed DoRA module in the image encoder and a LoRA module in the text decoder, this method can be efficiently fine-tuned on various downstream tasks.<n> Experiments show that our proposed DLoRA-TrOCR outperforms other parameter-efficient fine-tuning methods in recognizing complex scenes with mixed handwritten, printed, and street text.
arXiv Detail & Related papers (2024-04-19T09:28:16Z)
Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery. Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data. In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z)
Noise-Robust Dense Retrieval via Contrastive Alignment Post Training [89.29256833403167]
Contrastive Alignment POst Training (CAPOT) is a highly efficient finetuning method that improves model robustness without requiring index regeneration. CAPOT enables robust retrieval by freezing the document encoder while the query encoder learns to align noisy queries with their unaltered root. We evaluate CAPOT noisy variants of MSMARCO, Natural Questions, and Trivia QA passage retrieval, finding CAPOT has a similar impact as data augmentation with none of its overhead.
arXiv Detail & Related papers (2023-04-06T22:16:53Z)
Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects. Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency. We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.