Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding
- URL: http://arxiv.org/abs/2411.05254v1
- Date: Fri, 08 Nov 2024 00:58:12 GMT
- Title: Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding
- Authors: Jaeyoo Park, Jin Young Choi, Jeonghyung Park, Bohyung Han,
- Abstract summary: We present a novel OCR-free document understanding framework based on pretrained Multimodal Large Language Models (MLLMs)
Our approach employs multi-scale visual features to handle various font sizes within document images.
We introduce a novel instruction tuning task, which facilitates the model's text-reading capability by learning to predict the relative positions of input text.
- Score: 41.43688559565315
- License:
- Abstract: We present a novel OCR-free document understanding framework based on pretrained Multimodal Large Language Models (MLLMs). Our approach employs multi-scale visual features to effectively handle various font sizes within document images. To address the increasing costs of considering the multi-scale visual inputs for MLLMs, we propose the Hierarchical Visual Feature Aggregation (HVFA) module, designed to reduce the number of input tokens to LLMs. Leveraging a feature pyramid with cross-attentive pooling, our approach effectively manages the trade-off between information loss and efficiency without being affected by varying document image sizes. Furthermore, we introduce a novel instruction tuning task, which facilitates the model's text-reading capability by learning to predict the relative positions of input text, eventually minimizing the risk of truncated text caused by the limited capacity of LLMs. Comprehensive experiments validate the effectiveness of our approach, demonstrating superior performance in various document understanding tasks.
Related papers
- Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy [37.471419716572086]
There is a significant gap in instruction-following capabilities between Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs)
We propose Visual-Modality Token Compression (VMTC) and Cross-Modality Attention Inhibition (CMAI) strategies to alleviate this gap.
arXiv Detail & Related papers (2024-11-23T05:03:32Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - From Text to Pixel: Advancing Long-Context Understanding in MLLMs [70.78454154014989]
We introduce SEEKER, a multimodal large language model designed to tackle this issue.
SEEKER aims to optimize the compact encoding of long text by compressing the text sequence into the visual pixel space via images.
Our experiments on six long-context multimodal tasks demonstrate that SEEKER can leverage fewer image tokens to convey the same amount of textual information compared with the OCR-based approach.
arXiv Detail & Related papers (2024-05-23T06:17:23Z) - TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models [9.232693392690702]
TextHawk is a document-oriented Multimodal Large Language Model (MLLM)
It is designed to explore efficient fine-grained perception by designing four dedicated components.
We conduct extensive experiments on both general and document-oriented MLLM benchmarks, and show that TextHawk outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2024-04-14T09:48:37Z) - HRVDA: High-Resolution Visual Document Assistant [32.51417315241559]
We propose a High-Resolution Visual Document Assistant (HRVDA) to bridge the gap between MLLMs and visual document understanding.
HRVDA employs a content filtering mechanism and an instruction filtering module to filter out the content-agnostic visual tokens and instruction-agnostic visual tokens.
Our model achieves state-of-the-art performance across multiple document understanding datasets.
arXiv Detail & Related papers (2024-04-10T11:10:50Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - Enhancing Visual Document Understanding with Contrastive Learning in
Large Visual-Language Models [56.76307866160105]
We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo)
DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs)
We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
arXiv Detail & Related papers (2024-02-29T10:17:27Z) - Incorporating Visual Experts to Resolve the Information Loss in
Multimodal Large Language Models [121.83413400686139]
This paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism.
We introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline.
arXiv Detail & Related papers (2024-01-06T02:02:34Z) - Controllable Multi-document Summarization: Coverage & Coherence
Intuitive Policy with Large Language Model Based Rewards [42.171703872560286]
Controllability is a matter of concern when it comes to text generation tasks with long inputs, such as multi-document summarization.
We train a controllable content extraction scheme to extract the text that will be refined by an LLM.
Our approach yields competitive results in the evaluation using ROUGE metrics and outperforms potential baselines in coherence.
arXiv Detail & Related papers (2023-10-05T11:29:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.