Related papers: FocalOrder: Focal Preference Optimization for Reading Order Detection

FocalOrder: Focal Preference Optimization for Reading Order Detection

URL: http://arxiv.org/abs/2601.07483v1
Date: Mon, 12 Jan 2026 12:37:04 GMT
Title: FocalOrder: Focal Preference Optimization for Reading Order Detection
Authors: Fuyuan Liu, Dianyu Yu, He Ren, Nayu Liu, Xiaomian Kang, Delai Qiu, Fa Zhang, Genpeng Zhen, Shengping Liu, Jiaen Liang, Wei Huang, Yining Wang, Junnan Zhu,
Abstract summary: We propose textbfFocalOrder, a framework driven by textbfFocal Preference Optimization (FPO).<n>FocalOrder employs adaptive difficulty discovery with exponential moving average mechanism to dynamically pinpoint hard-to-learn transitions.<n>Experiments demonstrate that FocalOrder establishes new state-of-the-art results on OmniDocBench v1.0 and Comp-HRDoc.
Score: 23.497081928689525
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reading order detection is the foundation of document understanding. Most existing methods rely on uniform supervision, implicitly assuming a constant difficulty distribution across layout regions. In this work, we challenge this assumption by revealing a critical flaw: \textbf{Positional Disparity}, a phenomenon where models demonstrate mastery over the deterministic start and end regions but suffer a performance collapse in the complex intermediate sections. This degradation arises because standard training allows the massive volume of easy patterns to drown out the learning signals from difficult layouts. To address this, we propose \textbf{FocalOrder}, a framework driven by \textbf{Focal Preference Optimization (FPO)}. Specifically, FocalOrder employs adaptive difficulty discovery with exponential moving average mechanism to dynamically pinpoint hard-to-learn transitions, while introducing a difficulty-calibrated pairwise ranking objective to enforce global logical consistency. Extensive experiments demonstrate that FocalOrder establishes new state-of-the-art results on OmniDocBench v1.0 and Comp-HRDoc. Our compact model not only outperforms competitive specialized baselines but also significantly surpasses large-scale general VLMs. These results demonstrate that aligning the optimization with intrinsic structural ambiguity of documents is critical for mastering complex document structures.

Related papers

OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL [63.388513841293616]
Existing forgery detection methods fail to handle the interleaved text, images, and videos prevalent in real-world misinformation.<n>To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding.<n>We propose textbf OmniVL-Guard, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding.
arXiv Detail & Related papers (2026-02-11T09:41:36Z)
FMBench: Adaptive Large Language Model Output Formatting [49.52930069696333]
We present FMBench, a benchmark for adaptive Markdown output formatting.<n>Experiments on two model families show that SFT consistently improves semantic alignment.<n>Results also reveal an inherent trade-off between semantic and structural objectives.
arXiv Detail & Related papers (2026-02-06T04:42:06Z)
Efficient Causal Structure Learning via Modular Subgraph Integration [4.803851977437455]
We introduce VISTA, a modular framework that decomposes the global causal structure learning problem into local subgraphs based on Blankets.<n>The framework is model-agnostic, imposing no assumptions on the inductive biases of base learners, is compatible with arbitrary data settings, and fully supports parallelization.<n>Extensive experiments on both synthetic and real datasets consistently demonstrate the effectiveness of VISTA.
arXiv Detail & Related papers (2026-01-28T20:13:20Z)
Hi-ZFO: Hierarchical Zeroth- and First-Order LLM Fine-Tuning via Importance-Guided Tensor Selection [4.808936079900314]
We propose textbfHi-ZFO (textbfHierarchical textbfZeroth- and textbfFirst-textbfOrder optimization) to synergize FO gradients with ZO estimation.<n>We show that Hi-ZFO consistently achieves superior performance while significantly reducing the training time.
arXiv Detail & Related papers (2026-01-09T03:20:54Z)
ROAP: A Reading-Order and Attention-Prior Pipeline for Optimizing Layout Transformers in Key Information Extraction [5.594845708011402]
This paper presents ROAP, a lightweight and architecture-agnostic pipeline designed to optimize attention distributions in layout Transformers.<n> experiments on the FUNSD and CORD benchmarks demonstrate that ROAP consistently improves the performance of backbones.
arXiv Detail & Related papers (2026-01-09T02:02:37Z)
CogDoc: Towards Unified thinking in Documents [53.41571589733423]
We propose a unified coarse-to-fine thinking framework that mimics human cognitive processes: a low-resolution "Fast Reading" phase for scalable information localization, followed by a high-resolution "Focused Thinking" phase for deep reasoning.<n>We conduct a rigorous investigation into post-training strategies for the unified thinking framework, demonstrating that a Direct Reinforcement Learning approach outperforms RL with Supervised Fine-Tuning (SFT)<n>Specifically, we find that direct RL avoids the "policy conflict" observed in SFT.
arXiv Detail & Related papers (2025-12-14T12:14:17Z)
Hierarchical Evaluation of Software Design Capabilities of Large Language Models of Code [7.897548449569687]
Large language models (LLMs) are increasingly adopted in software engineering domain, yet robustness of their grasp on core design concepts remains unclear.<n>We generate poorly designed software fragments under various levels of guidance.<n> Reasoning about coupling proves brittle; performance collapses in noisy, open-ended scenarios.<n> Reasoning-trace analysis confirms these failure modes, revealing textitcognitive shortcutting for coupling versus a more exhaustive (yet still failing) analysis for cohesion.
arXiv Detail & Related papers (2025-11-25T23:50:00Z)
Adapformer: Adaptive Channel Management for Multivariate Time Series Forecasting [49.40321003932633]
Adapformer is an advanced Transformer-based framework that merges the benefits of CI and CD methodologies through effective channel management.<n>Adapformer achieves superior performance over existing models, enhancing both predictive accuracy and computational efficiency.
arXiv Detail & Related papers (2025-11-18T16:24:05Z)
Unifying Tree Search Algorithm and Reward Design for LLM Reasoning: A Survey [92.71325249013535]
Deliberative tree search is a cornerstone of Large Language Model (LLM) research.<n>This paper introduces a unified framework that deconstructs search algorithms into three core components.
arXiv Detail & Related papers (2025-10-11T03:29:18Z)
Beyond Fully Supervised Pixel Annotations: Scribble-Driven Weakly-Supervised Framework for Image Manipulation Localization [11.10178274806454]
We propose a form of weak supervision that improves the annotation efficiency and detection performance.<n>We re-annotated mainstream IML datasets with scribble labels and propose the first scribble-based IML dataset.<n>We employ self-supervised training with a structural consistency loss to encourage the model to produce consistent predictions.
arXiv Detail & Related papers (2025-07-17T11:45:27Z)
Understanding Generalization of Federated Learning: the Trade-off between Model Stability and Optimization [34.520966684699665]
Federated Learning (FL) is a distributed learning approach that trains machine learning models across multiple devices.<n>This paper introduces an innovative dynamics analysis framework, namely textitLibra, for algorithm generalization performance.<n>We show that larger local steps or momentum accelerate convergence of gradient norms, while worsening model stability.
arXiv Detail & Related papers (2024-11-25T11:43:22Z)
Disperse-Then-Merge: Pushing the Limits of Instruction Tuning via Alignment Tax Reduction [75.25114727856861]
Large language models (LLMs) tend to suffer from deterioration at the latter stage ofSupervised fine-tuning process. We introduce a simple disperse-then-merge framework to address the issue. Our framework outperforms various sophisticated methods such as data curation and training regularization on a series of standard knowledge and reasoning benchmarks.
arXiv Detail & Related papers (2024-05-22T08:18:19Z)
Are Layout-Infused Language Models Robust to Layout Distribution Shifts? A Case Study with Scientific Documents [54.744701806413204]
Recent work has shown that infusing layout features into language models (LMs) improves processing of visually-rich documents such as scientific papers. We test whether layout-infused LMs are robust to layout distribution shifts.
arXiv Detail & Related papers (2023-06-01T18:01:33Z)
Federated Conformal Predictors for Distributed Uncertainty Quantification [83.50609351513886]
Conformal prediction is emerging as a popular paradigm for providing rigorous uncertainty quantification in machine learning. In this paper, we extend conformal prediction to the federated learning setting. We propose a weaker notion of partial exchangeability, better suited to the FL setting, and use it to develop the Federated Conformal Prediction framework.
arXiv Detail & Related papers (2023-05-27T19:57:27Z)
Hard-normal Example-aware Template Mutual Matching for Industrial Anomaly Detection [78.734927709231]
Anomaly detectors are widely used in industrial manufacturing to detect and localize unknown defects in query images.<n>These detectors are trained on anomaly-free samples and have successfully distinguished anomalies from most normal samples.<n>However, hard-normal examples are scattered and far apart from most normal samples, and thus they are often mistaken for anomalies by existing methods.
arXiv Detail & Related papers (2023-03-28T17:54:56Z)
Fine-grained Retrieval Prompt Tuning [149.9071858259279]
Fine-grained Retrieval Prompt Tuning steers a frozen pre-trained model to perform the fine-grained retrieval task from the perspectives of sample prompt and feature adaptation. Our FRPT with fewer learnable parameters achieves the state-of-the-art performance on three widely-used fine-grained datasets.
arXiv Detail & Related papers (2022-07-29T04:10:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.