Related papers: DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models

DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models

URL: http://arxiv.org/abs/2410.03061v1
Date: Fri, 4 Oct 2024 00:53:32 GMT
Title: DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models
Authors: Sungnyun Kim, Haofu Liao, Srikar Appalaraju, Peng Tang, Zhuowen Tu, Ravi Kumar Satzoda, R. Manmatha, Vijay Mahadevan, Stefano Soatto,
Abstract summary: This study aims to enhance generalizability of small VDU models by distilling knowledge from LLMs. We present a new framework (called DocKD) that enriches the data generation process by integrating external document knowledge. Experiments show that DocKD produces high-quality document annotations and surpasses the direct knowledge distillation approach.
Score: 66.91204604417912
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual document understanding (VDU) is a challenging task that involves understanding documents across various modalities (text and image) and layouts (forms, tables, etc.). This study aims to enhance generalizability of small VDU models by distilling knowledge from LLMs. We identify that directly prompting LLMs often fails to generate informative and useful data. In response, we present a new framework (called DocKD) that enriches the data generation process by integrating external document knowledge. Specifically, we provide an LLM with various document elements like key-value pairs, layouts, and descriptions, to elicit open-ended answers. Our experiments show that DocKD produces high-quality document annotations and surpasses the direct knowledge distillation approach that does not leverage external document knowledge. Moreover, student VDU models trained with solely DocKD-generated data are not only comparable to those trained with human-annotated data on in-domain tasks but also significantly excel them on out-of-domain tasks.

Related papers

Information Extraction from Visually Rich Documents using LLM-based Organization of Documents into Independent Textual Segments [0.25289250870065627]
Specialized non-LLM NLP-based solutions typically involve training models using both textual and geometric information.<n>We propose BLOCKIE, a novel LLM-based approach that organizes VRDs into localized, reusable semantic textual segments.<n>Our approach outperforms the state-of-the-art on public VRD benchmarks by 1-3% in F1 scores.
arXiv Detail & Related papers (2025-05-18T15:49:17Z)
Relation-Rich Visual Document Generator for Visual Information Extraction [12.4941229258054]
We propose a Relation-rIch visual Document GEnerator (RIDGE) that addresses these limitations through a two-stage approach. Our method significantly enhances the performance of document understanding models on various VIE benchmarks.
arXiv Detail & Related papers (2025-04-14T19:19:26Z)
Extract Information from Hybrid Long Documents Leveraging LLMs: A Framework and Dataset [52.286323454512996]
Large Language Models (LLMs) can comprehend and analyze hybrid text, containing textual and tabular data. We propose an Automated Information Extraction framework (AIE) to enable LLMs to process the hybrid long documents (HLDs) and carry out experiments to analyse four important aspects of information extraction from HLDs. To address the issue of dataset scarcity in HLDs and support future work, we also propose the Financial Reports Numerical Extraction (FINE) dataset.
arXiv Detail & Related papers (2024-12-28T07:54:14Z)
Leveraging Distillation Techniques for Document Understanding: A Case Study with FLAN-T5 [0.0]
We present a novel approach wherein we distill document understanding knowledge from the proprietary LLM ChatGPT into FLAN-T5. Our findings underscore the potential of distillation techniques in facilitating the deployment of sophisticated language models in real-world scenarios.
arXiv Detail & Related papers (2024-09-17T15:37:56Z)
Instruction-tuned Language Models are Better Knowledge Learners [106.38526595116961]
We propose pre-instruction-tuning (PIT), a method that instruction-tunes on questions prior to training on documents. Extensive experiments and ablation studies demonstrate that pre-instruction-tuning significantly enhances the ability of LLMs to absorb knowledge from new documents.
arXiv Detail & Related papers (2024-02-20T09:20:32Z)
InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions [30.609533589284634]
InstructDoc is the first large-scale collection of 30 publicly available visual document understanding datasets. InstructDr connects document images, image encoders, and large language models (LLMs) through a trainable bridging module. Experiments demonstrate that InstructDr can effectively adapt to new VDU datasets, tasks, and domains via given instructions.
arXiv Detail & Related papers (2024-01-24T09:09:37Z)
Privacy-Aware Document Visual Question Answering [44.82362488593259]
This work highlights privacy issues in state of the art multi-modal LLM models used for DocVQA. We propose a large scale DocVQA dataset comprising invoice documents and associated questions and answers. We demonstrate that non-private models tend to memorise, a behaviour that can lead to exposing private information.
arXiv Detail & Related papers (2023-12-15T06:30:55Z)
LMDX: Language Model-based Document Information Extraction and Localization [23.656970495804963]
Large Language Models (LLM) have revolutionized Natural Language Processing (NLP) Their application in extracting information from visually rich documents has not yet been successful. Main obstacles to adopting LLMs for this task include the absence of layout encoding within LLMs.
arXiv Detail & Related papers (2023-09-19T22:32:56Z)
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page. Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z)
DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models. The collected dataset, named DocumentNet, does not depend on specific document types or entity sets. Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z)
Layout-Aware Information Extraction for Document-Grounded Dialogue: Dataset, Method and Demonstration [75.47708732473586]
We propose a layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents. LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents. Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.
arXiv Detail & Related papers (2022-07-14T07:59:45Z)
Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding. UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input. An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.