DocVAL: Validated Chain-of-Thought Distillation for Grounded Document VQA
- URL: http://arxiv.org/abs/2511.22521v1
- Date: Thu, 27 Nov 2025 15:00:58 GMT
- Title: DocVAL: Validated Chain-of-Thought Distillation for Grounded Document VQA
- Authors: Ahmad Mohammadshirazi, Pinaki Prasad Guha Neogi, Dheeraj Kulshrestha, Rajiv Ramnath,
- Abstract summary: Document visual question answering (DocVQA) requires models to jointly reason over textual content and spatial layout.<n>Current systems exhibit a sharp accuracy--efficiency trade-off: large teacher models achieve strong grounding but are too expensive for deployment.<n>We propose DocVAL, a validated chain-of-thought distillation framework that transfers the spatial reasoning ability of a large teacher into a deployable student VLM.
- Score: 1.580774794371876
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Document visual question answering (DocVQA) requires models to jointly reason over textual content and spatial layout, yet current systems exhibit a sharp accuracy--efficiency trade-off: large teacher models achieve strong grounding but are too expensive for deployment, while compact students suffer substantial drops in localization performance. We propose DocVAL, a validated chain-of-thought distillation framework that transfers the spatial reasoning ability of a large teacher into a deployable student VLM through three key components: (1) teacher supervision with validation-time text detection to filter and denoise training signals, (2) a multi-module validator (VAL) that enforces answer correctness and geometric consistency while producing fine-grained, pixel-level error feedback, and (3) a two-stage student training scheme that first learns from validated CoT traces and then undergoes iterative refinement driven by VAL feedback. Our student (Gemma-3 12B) achieves 91.4\% ANLS and 82.4\% mAP on DocVQA as a pure VLM requiring no text detection or OCR at inference. Extensive ablations demonstrate that validated feedback contributes 6.3 mAP gain and iterative refinement accounts for 9.7 mAP improvement. We release 95k high-quality, validator-verified CoT traces to advance spatial reasoning research in document understanding.
Related papers
- GATES: Self-Distillation under Privileged Context with Consensus Gating [89.62339954332248]
We study self-distillation in settings where supervision is unreliable.<n>We focus on document-grounded question answering with asymmetric context.<n>We derive supervision online from tutor consensus by sampling multiple document-grounded reasoning traces.
arXiv Detail & Related papers (2026-02-24T05:56:20Z) - PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering [71.15346406323827]
We introduce PRIME, a benchmark for evaluating verifiers on Process-Outcome Alignment verification.<n>We find that current verifiers frequently fail to detect derivation flaws.<n>We propose a process-aware RLVR training paradigm utilizing verifiers selected via PRIME.
arXiv Detail & Related papers (2026-02-12T04:45:01Z) - Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO [24.91321958525287]
Distilling Chain-of-Thought (CoT) reasoning from large language models into compact student models presents a fundamental challenge.<n>Existing approaches either compress reasoning into single-step, losing the interpretability that makes CoT valuable.<n>We present a three-stage curriculum learning framework that addresses this capacity mismatch through progressive skill acquisition.
arXiv Detail & Related papers (2026-02-05T05:27:11Z) - Fair Context Learning for Evidence-Balanced Test-Time Adaptation in Vision-Language Models [10.45965859391796]
Test-Time Adaptation (TTA) aims to improve robustness using only unlabeled test samples.<n>Most prompt-based TTA methods rely on entropy minimization.<n>We propose Fair Context Learning (FCL) that avoids entropy minimization by explicitly addressing shared-evidence bias.
arXiv Detail & Related papers (2026-02-02T16:02:50Z) - From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for Reinforcement Learning of Open-ended Generation [52.62655622099456]
We propose reinforcement learning with verifiable reference-based rewards (RLVRR)<n>Instead of checking the final answer, RLVRR extracts an ordered linguistic signal from high-quality references (i.e., reward chain)<n>In this way, RLVRR decomposes rewards into two dimensions: content, which preserves deterministic core concepts, and style, which evaluates adherence to stylistic properties.
arXiv Detail & Related papers (2026-01-26T14:39:58Z) - GRACE: Reinforcement Learning for Grounded Response and Abstention under Contextual Evidence [9.80421132842862]
Retrieval-Augmented Generation (RAG) integrates external knowledge to enhance Large Language Models (LLMs)<n>RAG is susceptible to two critical flaws: providing correct answers without explicit grounded evidence and producing fabricated responses when the retrieved context is insufficient.<n>We propose GRACE, a reinforcement-learning framework that simultaneously mitigates both types of flaws.
arXiv Detail & Related papers (2026-01-08T02:47:33Z) - Look As You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG via Reinforcement Learning [55.232400251303794]
Look As You Think (LAT) is a reinforcement learning framework that trains models to produce verifiable reasoning paths with consistent attribution.<n>LAT consistently improves the vanilla model in both single- and multi-image settings, yielding average gains of 8.23% in soft exact match (EM) and 47.0% in IoU@0.5.
arXiv Detail & Related papers (2025-11-15T02:50:23Z) - Detect Anything via Next Point Prediction [51.55967987350882]
Rex- Omni is a 3B-scale MLLM that achieves state-of-the-art object perception performance.<n>On benchmarks like COCO and LVIS, Rex- Omni attains performance comparable to or exceeding regression-based models.
arXiv Detail & Related papers (2025-10-14T17:59:54Z) - Continual Action Quality Assessment via Adaptive Manifold-Aligned Graph Regularization [53.82400605816587]
Action Quality Assessment (AQA) quantifies human actions in videos, supporting applications in sports scoring, rehabilitation, and skill evaluation.<n>A major challenge lies in the non-stationary nature of quality distributions in real-world scenarios.<n>We introduce Continual AQA (CAQA), which equips with Continual Learning capabilities to handle evolving distributions.
arXiv Detail & Related papers (2025-10-08T10:09:47Z) - SPELL: Self-Play Reinforcement Learning for evolving Long-Context Language Models [79.01078135582127]
SPELL enables scalable, label-free optimization for long-context reasoning.<n>We introduce an automated curriculum that gradually increases document length and a reward function that adapts question difficulty to the model's evolving capabilities.
arXiv Detail & Related papers (2025-09-28T13:08:10Z) - Toward Reproducible Cross-Backend Compatibility for Deep Learning: A Configuration-First Framework with Three-Tier Verification [1.5269986601063288]
This paper presents a configuration-first framework for evaluating cross-backend compatibility in deep learning systems.<n>The framework decouples experiments from code using YAML, supports both library and repository models, and employs a three-tier verification protocol.<n>We observe that 72.0% of runs pass, with most discrepancies occurring under stricter thresholds.
arXiv Detail & Related papers (2025-08-29T16:28:28Z) - AssertCoder: LLM-Based Assertion Generation via Multimodal Specification Extraction [32.14733357890831]
We propose AssertCoder, a novel unified framework that automatically generates high-quality SVAs.<n>AssertCoder employs a modality-sensitive preprocessing to parse heterogeneous specification formats.<n>The framework incorporates a mutation-based evaluation approach to assess assertion quality.
arXiv Detail & Related papers (2025-07-14T14:43:14Z) - The Surprising Effectiveness of Test-Time Training for Few-Shot Learning [59.309477460893916]
Language models (LMs) have shown impressive performance on tasks within their training distribution, but often struggle with structurally novel tasks.<n>We investigate the effectiveness of test-time training (TTT) as a mechanism for improving LMs' reasoning and few-shot learning capabilities.<n>Our findings highlight the limitations of in-context learning for novel tasks and demonstrate the potential of test-time training to enhance language model adaptability.
arXiv Detail & Related papers (2024-11-11T18:59:45Z) - Improving Embedding Accuracy for Document Retrieval Using Entity Relationship Maps and Model-Aware Contrastive Sampling [0.0]
APEX-Embedding-7B is a 7-billion parameter decoder-only text Feature Extraction Model.
Our approach employs two training techniques that yield an emergent improvement in factual focus.
Based on our evaluations, our model establishes a new state-of-the-art standard in text feature extraction for longer context document retrieval tasks.
arXiv Detail & Related papers (2024-10-08T17:36:48Z) - Weighted KL-Divergence for Document Ranking Model Refinement [11.29398362479766]
This paper contrastively reweights KL divergence terms to prioritize the alignment between a student and a teacher model for proper separation of positive and negative documents.
This paper analyzes and evaluates the proposed loss function on the MS MARCO and BEIR datasets to demonstrate its effectiveness in improving the relevance of tested student models.
arXiv Detail & Related papers (2024-06-10T02:29:35Z) - Do Compressed LLMs Forget Knowledge? An Experimental Study with
Practical Implications [63.29358103217275]
Large Language Models (LLMs) often leads to reduced performance, especially for knowledge-intensive tasks.
We propose two conjectures on the nature of the damage: one is certain knowledge being forgotten (or erased) after compression.
We introduce a variant called Inference-time Dynamic Prompting (IDP) that can effectively increase prompt diversity without incurring any inference overhead.
arXiv Detail & Related papers (2023-10-02T03:12:06Z) - Compressing Visual-linguistic Model via Knowledge Distillation [43.73998154661652]
We study knowledge distillation to compress a transformer-based large visual-linguistic model into a small model.
We show that our proposed distillation significantly improves the performance of small VL models on image captioning and visual question answering tasks.
arXiv Detail & Related papers (2021-04-05T18:02:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.