Interpret, prune and distill Donut : towards lightweight VLMs for VQA on document
- URL: http://arxiv.org/abs/2509.26235v1
- Date: Tue, 30 Sep 2025 13:31:03 GMT
- Title: Interpret, prune and distill Donut : towards lightweight VLMs for VQA on document
- Authors: Adnan Ben Mansour, Ayoub Karine, David Naccache,
- Abstract summary: We investigate model compression through knowledge distillation, training compact student models from a larger teacher.<n>We leverage mechanistic interpretability to drive student architecture design within this framework.<n>This approach yields Donut-MINT, a pruned Donut variant that reduces inference time and memory usage while maintaining strong performance on DocVQA.
- Score: 1.733255162390776
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent advances in Visually-rich Document Understanding rely on large Vision-Language Models like Donut, which perform document-level Visual Question Answering without Optical Character Recognition. Despite their effectiveness, these models are too costly for real-time or resource-constrained applications. We investigate model compression through knowledge distillation, training compact student models from a larger teacher. We leverage mechanistic interpretability to drive student architecture design within this framework. By analyzing internal computations, we identify essential subcomponents to retain, while having a clear view of which subcomponents should be approximated, skipped, or reparametrized based on their function. This approach yields Donut-MINT (Mechanistic Interpretability-based Network Trimming), a pruned Donut variant that reduces inference time and memory usage while maintaining strong performance on DocVQA, a standard benchmark for document Visual Question Answering. Our method reframes compression as circuit discovery, bridging interpretability research and practical Vision-Language Model deployment.
Related papers
- PARL: Position-Aware Relation Learning Network for Document Layout Analysis [23.497081928689525]
We argue that effective layout analysis depends not on text-visual fusion, but on a deep understanding of documents' intrinsic visual structure.<n>We propose a novel OCR-free, vision-only framework that models layout through positional sensitivity and relational structure.<n>Experiments show that PARL (65M) is highly efficient, using roughly four times fewer parameters than large multimodal models.
arXiv Detail & Related papers (2026-01-12T15:05:35Z) - Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding [49.26132236798123]
Vision Language Models (VLMs) have gradually become a primary approach in document understanding.<n>We propose SLEUTH, a multi agent framework that orchestrates a retriever and four collaborative agents in a coarse to fine process.<n>The framework identifies key textual and visual clues within the retrieved pages, filters for salient visual evidence such as tables and charts, and analyzes the query to devise a reasoning strategy.
arXiv Detail & Related papers (2025-11-28T03:09:40Z) - EdiVal-Agent: An Object-Centric Framework for Automated, Scalable, Fine-Grained Evaluation of Multi-Turn Editing [170.71134330650796]
EdiVal-Agent is an evaluation framework for multi-turn instruction-based editing.<n>It synthesizes semantically meaningful objects, then synthesizes diverse, context-aware editing instructions.<n>It integrates vision-language models with object detectors to assess instruction following, uses semantic-level feature extractors to evaluate content consistency, and leverages human preference models to judge visual quality.
arXiv Detail & Related papers (2025-09-16T17:45:39Z) - DocVXQA: Context-Aware Visual Explanations for Document Question Answering [12.416787701296236]
We propose DocVXQA, a novel framework for visually self-explainable document question answering.<n>The framework is designed not only to produce accurate answers to questions but also to learn visual heatmaps that highlight contextually critical regions.
arXiv Detail & Related papers (2025-05-12T12:30:16Z) - If Concept Bottlenecks are the Question, are Foundation Models the Answer? [20.91927788087174]
Concept Bottleneck Models (CBMs) are neural networks designed to conjoin high performance with ante-hoc interpretability.<n>"VLM-CBM" architectures replace manual annotations with weak supervision from foundation models.<n>We put state-of-the-art VLM-CBMs to the test, analyzing their learned concepts empirically using a selection of significant metrics.
arXiv Detail & Related papers (2025-04-28T13:18:48Z) - DocMamba: Efficient Document Pre-training with State Space Model [56.84200017560988]
We present DocMamba, a novel framework based on the state space model.<n>It is designed to reduce computational complexity to linear while preserving global modeling capabilities.<n>Experiments on the HRDoc confirm DocMamba's potential for length extrapolation.
arXiv Detail & Related papers (2024-09-18T11:34:28Z) - Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach [87.8330887605381]
We show how to adapt a pre-trained Vision Transformer to downstream recognition tasks with only a few learnable parameters.
We synthesize a task-specific query with a learnable and lightweight module, which is independent of the pre-trained model.
Our method achieves state-of-the-art performance under memory constraints, showcasing its applicability in real-world situations.
arXiv Detail & Related papers (2024-07-09T15:45:04Z) - DistilDoc: Knowledge Distillation for Visually-Rich Document Applications [22.847266820057985]
This work explores knowledge distillation for visually-rich document applications such as document layout analysis (DLA) and document image classification (DIC)<n>We design a KD experimentation methodology for more lean, performant models on document understanding tasks that are integral within larger task pipelines.<n>We study what affects the teacher-student knowledge gap and find that some methods (tuned vanilla KD, MSE, SimKD with an apt projector) can consistently outperform supervised student training.
arXiv Detail & Related papers (2024-06-12T13:55:12Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - Layout and Task Aware Instruction Prompt for Zero-shot Document Image
Question Answering [13.942561172695815]
We find that instruction-tuning language models like Claude and ChatGPT can understand layout by spaces and line breaks.
We propose the LAyout and Task aware Instruction Prompt (LATIN-Prompt) to improve the performance of small instruction-tuning models like Alpaca.
arXiv Detail & Related papers (2023-06-01T10:28:12Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - Augmenting Pre-trained Language Models with QA-Memory for Open-Domain
Question Answering [38.071375112873675]
We propose a question-answer augmented encoder-decoder model and accompanying pretraining strategy.
This yields an end-to-end system that outperforms prior QA retrieval methods on single-hop QA tasks.
arXiv Detail & Related papers (2022-04-10T02:33:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.