DONUT-hole: DONUT Sparsification by Harnessing Knowledge and Optimizing
Learning Efficiency
- URL: http://arxiv.org/abs/2311.05778v1
- Date: Thu, 9 Nov 2023 22:49:05 GMT
- Title: DONUT-hole: DONUT Sparsification by Harnessing Knowledge and Optimizing
Learning Efficiency
- Authors: Azhar Shaikh and Michael Cochez and Denis Diachkov and Michiel de
Rijcke and Sahar Yousefi
- Abstract summary: This paper introduces DONUT-hole, a sparse OCR-free visual document understanding (VDU) model that addresses the limitations of its predecessor model, dubbed DONUT.
Our paradigm to produce DONUT-hole, reduces the model denisty by 54% while preserving performance.
- Score: 5.006064616335817
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: This paper introduces DONUT-hole, a sparse OCR-free visual document
understanding (VDU) model that addresses the limitations of its predecessor
model, dubbed DONUT. The DONUT model, leveraging a transformer architecture,
overcoming the challenges of separate optical character recognition (OCR) and
visual semantic understanding (VSU) components. However, its deployment in
production environments and edge devices is hindered by high memory and
computational demands, particularly in large-scale request services. To
overcome these challenges, we propose an optimization strategy based on
knowledge distillation and model pruning. Our paradigm to produce DONUT-hole,
reduces the model denisty by 54\% while preserving performance. We also achieve
a global representational similarity index between DONUT and DONUT-hole based
on centered kernel alignment (CKA) metric of 0.79. Moreover, we evaluate the
effectiveness of DONUT-hole in the document image key information extraction
(KIE) task, highlighting its potential for developing more efficient VDU
systems for logistic companies.
Related papers
- Ultra-Resolution Adaptation with Ease [62.56434979517156]
We propose a set of key guidelines for ultra-resolution adaptation termed emphURAE.
We show that tuning minor components of the weight matrices outperforms widely-used low-rank adapters when synthetic data are unavailable.
Experiments validate that URAE achieves comparable 2K-generation performance to state-of-the-art closed-source models like FLUX1.1 [Pro] Ultra with only 3K samples and 2K iterations.
arXiv Detail & Related papers (2025-03-20T16:44:43Z) - Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation [11.217033010884006]
We show that scaling the decoder yields significant performance gains, always exceeding those achieved by encoder scaling alone.
We also identify label noise as a key challenge in Scene Text Recognition, particularly in real-world data.
Our methodology achieves state-of-the-art performance on 10 out of 11 benchmarks using only real data, while significantly reducing the parameter size and computational costs.
arXiv Detail & Related papers (2025-03-20T14:35:46Z) - AI-in-the-Loop Sensing and Communication Joint Design for Edge Intelligence [65.29835430845893]
We propose a framework that enhances edge intelligence through AI-in-the-loop joint sensing and communication.
A key contribution of our work is establishing an explicit relationship between validation loss and the system's tunable parameters.
We show that our framework reduces communication energy consumption by up to 77 percent and sensing costs measured by the number of samples by up to 52 percent.
arXiv Detail & Related papers (2025-02-14T14:56:58Z) - Causal Context Adjustment Loss for Learned Image Compression [72.7300229848778]
In recent years, learned image compression (LIC) technologies have surpassed conventional methods notably in terms of rate-distortion (RD) performance.
Most present techniques are VAE-based with an autoregressive entropy model, which obviously promotes the RD performance by utilizing the decoded causal context.
In this paper, we make the first attempt in investigating the way to explicitly adjust the causal context with our proposed Causal Context Adjustment loss.
arXiv Detail & Related papers (2024-10-07T09:08:32Z) - Embedding-Free Transformer with Inference Spatial Reduction for Efficient Semantic Segmentation [15.377463849213033]
EFA is a novel global context modeling mechanism that focuses on functioning the global non-linearity.
Our ISR method reduces the key-value resolution at the inference phase, which can mitigate the computation-performance trade-off gap.
EDAFormer shows the state-of-the-art performance with the efficient computation compared to the existing transformer-based semantic segmentation models.
arXiv Detail & Related papers (2024-07-24T13:24:25Z) - Any Image Restoration with Efficient Automatic Degradation Adaptation [132.81912195537433]
We propose a unified manner to achieve joint embedding by leveraging the inherent similarities across various degradations for efficient and comprehensive restoration.
Our network sets new SOTA records while reducing model complexity by approximately -82% in trainable parameters and -85% in FLOPs.
arXiv Detail & Related papers (2024-07-18T10:26:53Z) - HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models [96.76995840807615]
HiRes-LLaVA is a novel framework designed to process any size of high-resolution input without altering the original contextual and geometric information.
HiRes-LLaVA comprises two innovative components: (i) a SliceRestore adapter that reconstructs sliced patches into their original form, efficiently extracting both global and local features via down-up-sampling and convolution layers, and (ii) a Self-Mining Sampler to compress the vision tokens based on themselves.
arXiv Detail & Related papers (2024-07-11T17:42:17Z) - PREM: A Simple Yet Effective Approach for Node-Level Graph Anomaly
Detection [65.24854366973794]
Node-level graph anomaly detection (GAD) plays a critical role in identifying anomalous nodes from graph-structured data in domains such as medicine, social networks, and e-commerce.
We introduce a simple method termed PREprocessing and Matching (PREM for short) to improve the efficiency of GAD.
Our approach streamlines GAD, reducing time and memory consumption while maintaining powerful anomaly detection capabilities.
arXiv Detail & Related papers (2023-10-18T02:59:57Z) - Accurate and Structured Pruning for Efficient Automatic Speech
Recognition [23.897482741744117]
We propose a novel compression strategy to reduce the model size and inference cost of the Conformer model.
Our method achieves a 50% reduction in model size and a 28% reduction in inference cost with minimal performance loss.
arXiv Detail & Related papers (2023-05-31T04:31:16Z) - Hierarchical State Abstraction Based on Structural Information
Principles [70.24495170921075]
We propose a novel mathematical Structural Information principles-based State Abstraction framework, namely SISA, from the information-theoretic perspective.
SISA is a general framework that can be flexibly integrated with different representation-learning objectives to improve their performances further.
arXiv Detail & Related papers (2023-04-24T11:06:52Z) - Controllable Textual Inversion for Personalized Text-to-Image Generation [24.18758951295929]
Text inversion (TI) is proposed as an effective technique in personalizing the generation when the prompts contain user-defined, unseen or long-tail concept tokens.
In this work, we propose a much-enhanced version of TI, dubbed Controllable Textual Inversion (COTI) in resolving all the aforementioned problems and in turn delivering a robust, data-efficient and easy-to-use framework.
arXiv Detail & Related papers (2023-04-11T14:56:44Z) - Studying How to Efficiently and Effectively Guide Models with Explanations [52.498055901649025]
'Model guidance' is the idea of regularizing the models' explanations to ensure that they are "right for the right reasons"
We conduct an in-depth evaluation across various loss functions, attribution methods, models, and 'guidance depths' on the PASCAL VOC 2007 and MS COCO 2014 datasets.
Specifically, we guide the models via bounding box annotations, which are much cheaper to obtain than the commonly used segmentation masks.
arXiv Detail & Related papers (2023-03-21T15:34:50Z) - Donut: Document Understanding Transformer without OCR [17.397447819420695]
We propose a novel VDU model that is end-to-end trainable without underpinning OCR framework.
Our approach achieves state-of-the-art performance on various document understanding tasks in public benchmark datasets and private industrial service datasets.
arXiv Detail & Related papers (2021-11-30T18:55:19Z) - Hierarchical and Efficient Learning for Person Re-Identification [19.172946887940874]
We propose a novel Hierarchical and Efficient Network (HENet) that learns hierarchical global, partial, and recovery features ensemble under the supervision of multiple loss combinations.
We also propose a new dataset augmentation approach, dubbed Random Polygon Erasing (RPE), to random erase irregular area of the input image for imitating the body part missing.
arXiv Detail & Related papers (2020-05-18T15:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.