DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models
- URL: http://arxiv.org/abs/2410.03061v1
- Date: Fri, 4 Oct 2024 00:53:32 GMT
- Title: DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models
- Authors: Sungnyun Kim, Haofu Liao, Srikar Appalaraju, Peng Tang, Zhuowen Tu, Ravi Kumar Satzoda, R. Manmatha, Vijay Mahadevan, Stefano Soatto,
- Abstract summary: This study aims to enhance generalizability of small VDU models by distilling knowledge from LLMs.
We present a new framework (called DocKD) that enriches the data generation process by integrating external document knowledge.
Experiments show that DocKD produces high-quality document annotations and surpasses the direct knowledge distillation approach.
- Score: 66.91204604417912
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual document understanding (VDU) is a challenging task that involves understanding documents across various modalities (text and image) and layouts (forms, tables, etc.). This study aims to enhance generalizability of small VDU models by distilling knowledge from LLMs. We identify that directly prompting LLMs often fails to generate informative and useful data. In response, we present a new framework (called DocKD) that enriches the data generation process by integrating external document knowledge. Specifically, we provide an LLM with various document elements like key-value pairs, layouts, and descriptions, to elicit open-ended answers. Our experiments show that DocKD produces high-quality document annotations and surpasses the direct knowledge distillation approach that does not leverage external document knowledge. Moreover, student VDU models trained with solely DocKD-generated data are not only comparable to those trained with human-annotated data on in-domain tasks but also significantly excel them on out-of-domain tasks.
Related papers
- Leveraging Distillation Techniques for Document Understanding: A Case Study with FLAN-T5 [0.0]
We present a novel approach wherein we distill document understanding knowledge from the proprietary LLM ChatGPT into FLAN-T5.
Our findings underscore the potential of distillation techniques in facilitating the deployment of sophisticated language models in real-world scenarios.
arXiv Detail & Related papers (2024-09-17T15:37:56Z) - Instruction-tuned Language Models are Better Knowledge Learners [106.38526595116961]
We propose pre-instruction-tuning (PIT), a method that instruction-tunes on questions prior to training on documents.
Extensive experiments and ablation studies demonstrate that pre-instruction-tuning significantly enhances the ability of LLMs to absorb knowledge from new documents.
arXiv Detail & Related papers (2024-02-20T09:20:32Z) - InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document
Understanding with Instructions [30.609533589284634]
InstructDoc is the first large-scale collection of 30 publicly available visual document understanding datasets.
InstructDr connects document images, image encoders, and large language models (LLMs) through a trainable bridging module.
Experiments demonstrate that InstructDr can effectively adapt to new VDU datasets, tasks, and domains via given instructions.
arXiv Detail & Related papers (2024-01-24T09:09:37Z) - Privacy-Aware Document Visual Question Answering [44.82362488593259]
This work highlights privacy issues in state of the art multi-modal LLM models used for DocVQA.
We propose a large scale DocVQA dataset comprising invoice documents and associated questions and answers.
We demonstrate that non-private models tend to memorise, a behaviour that can lead to exposing private information.
arXiv Detail & Related papers (2023-12-15T06:30:55Z) - LMDX: Language Model-based Document Information Extraction and Localization [23.656970495804963]
Large Language Models (LLM) have revolutionized Natural Language Processing (NLP)
Their application in extracting information from visually rich documents has not yet been successful.
Main obstacles to adopting LLMs for this task include the absence of layout encoding within LLMs.
arXiv Detail & Related papers (2023-09-19T22:32:56Z) - mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document
Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page.
Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z) - DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models.
The collected dataset, named DocumentNet, does not depend on specific document types or entity sets.
Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z) - Layout-Aware Information Extraction for Document-Grounded Dialogue:
Dataset, Method and Demonstration [75.47708732473586]
We propose a layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents.
LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents.
Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.
arXiv Detail & Related papers (2022-07-14T07:59:45Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.