PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling
- URL: http://arxiv.org/abs/2410.05970v1
- Date: Tue, 8 Oct 2024 12:17:42 GMT
- Title: PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling
- Authors: Xudong Xie, Liang Yin, Hao Yan, Yang Liu, Jing Ding, Minghui Liao, Yuliang Liu, Wei Chen, Xiang Bai,
- Abstract summary: Document understanding is a challenging task to process and comprehend large amounts of textual and visual information.
Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task.
We introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
- Score: 63.93112754821312
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Document understanding is a challenging task to process and comprehend large amounts of textual and visual information. Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task. However, existing methods typically focus on either plain text or a limited number of document images, struggling to handle long PDF documents with interleaved text and images, especially in academic papers. In this paper, we introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents. PDF-WuKong incorporates a sparse sampler that operates on both text and image representations, significantly improving the efficiency and capability of the MLLM. The sparse sampler is integrated with the MLLM's image encoder and selects the paragraphs or diagrams most pertinent to user queries for processing by the language model. To effectively train and evaluate our model, we construct PaperPDF, a dataset consisting of a broad collection of academic papers sourced from arXiv, multiple strategies are proposed to generate automatically 1M QA pairs along with their corresponding evidence sources. Experimental results demonstrate the superiority and high efficiency of our approach over other models on the task of long multimodal PDF understanding, surpassing proprietary products by an average of 8.6% on F1. Our code and dataset will be released at https://github.com/yh-hust/PDF-Wukong.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework [75.95430061891828]
We introduce M-LongDoc, a benchmark of 851 samples, and an automated framework to evaluate the performance of large multimodal models.
We propose a retrieval-aware tuning approach for efficient and effective multimodal document reading.
arXiv Detail & Related papers (2024-11-09T13:30:38Z) - Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding [41.43688559565315]
We present a novel OCR-free document understanding framework based on pretrained Multimodal Large Language Models (MLLMs)
Our approach employs multi-scale visual features to handle various font sizes within document images.
We introduce a novel instruction tuning task, which facilitates the model's text-reading capability by learning to predict the relative positions of input text.
arXiv Detail & Related papers (2024-11-08T00:58:12Z) - Hierarchical Multi-modal Transformer for Cross-modal Long Document Classification [74.45521856327001]
How to classify long documents with hierarchical structure texts and embedding images is a new problem.
We propose a novel approach called Hierarchical Multi-modal Transformer (HMT) for cross-modal long document classification.
Our approach uses a multi-modal transformer and a dynamic multi-scale multi-modal transformer to model the complex relationships between image features, and the section and sentence features.
arXiv Detail & Related papers (2024-07-14T07:12:25Z) - From Text to Pixel: Advancing Long-Context Understanding in MLLMs [70.78454154014989]
We introduce SEEKER, a multimodal large language model designed to tackle this issue.
SEEKER aims to optimize the compact encoding of long text by compressing the text sequence into the visual pixel space via images.
Our experiments on six long-context multimodal tasks demonstrate that SEEKER can leverage fewer image tokens to convey the same amount of textual information compared with the OCR-based approach.
arXiv Detail & Related papers (2024-05-23T06:17:23Z) - TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models [9.232693392690702]
TextHawk is a document-oriented Multimodal Large Language Model (MLLM)
It is designed to explore efficient fine-grained perception by designing four dedicated components.
We conduct extensive experiments on both general and document-oriented MLLM benchmarks, and show that TextHawk outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2024-04-14T09:48:37Z) - LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding [0.0]
This paper proposes LayoutLLM, a more flexible document analysis method for understanding imaged documents.
Existing methods have been developed to enhance document comprehension by incorporating pre-training awareness of images, text, and layout structure.
Our experiments demonstrate improvement over the baseline model in various document analysis tasks.
arXiv Detail & Related papers (2024-03-21T09:25:24Z) - mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document
Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page.
Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.