Towards Improving Document Understanding: An Exploration on
Text-Grounding via MLLMs
- URL: http://arxiv.org/abs/2311.13194v2
- Date: Fri, 15 Dec 2023 08:12:32 GMT
- Title: Towards Improving Document Understanding: An Exploration on
Text-Grounding via MLLMs
- Authors: Yonghui Wang, Wengang Zhou, Hao Feng, Keyi Zhou, Houqiang Li
- Abstract summary: We present a text-grounding document understanding model, termed TGDoc, which enhances MLLMs with the ability to discern the spatial positioning of text within images.
We formulate instruction tuning tasks including text detection, recognition, and spotting to facilitate the cohesive alignment between the visual encoder and large language model.
Our method achieves state-of-the-art performance across multiple text-rich benchmarks, validating the effectiveness of our method.
- Score: 96.54224331778195
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the field of document understanding, significant advances have been made
in the fine-tuning of Multimodal Large Language Models (MLLMs) with
instruction-following data. Nevertheless, the potential of text-grounding
capability within text-rich scenarios remains underexplored. In this paper, we
present a text-grounding document understanding model, termed TGDoc, which
addresses this deficiency by enhancing MLLMs with the ability to discern the
spatial positioning of text within images. Empirical evidence suggests that
text-grounding improves the model's interpretation of textual content, thereby
elevating its proficiency in comprehending text-rich images. Specifically, we
compile a dataset containing 99K PowerPoint presentations sourced from the
internet. We formulate instruction tuning tasks including text detection,
recognition, and spotting to facilitate the cohesive alignment between the
visual encoder and large language model. Moreover, we curate a collection of
text-rich images and prompt the text-only GPT-4 to generate 12K high-quality
conversations, featuring textual locations within text-rich scenarios. By
integrating text location data into the instructions, TGDoc is adept at
discerning text locations during the visual question process. Extensive
experiments demonstrate that our method achieves state-of-the-art performance
across multiple text-rich benchmarks, validating the effectiveness of our
method.
Related papers
- Towards Unified Multi-granularity Text Detection with Interactive Attention [56.79437272168507]
"Detect Any Text" is an advanced paradigm that unifies scene text detection, layout analysis, and document page detection into a cohesive, end-to-end model.
A pivotal innovation in DAT is the across-granularity interactive attention module, which significantly enhances the representation learning of text instances.
Tests demonstrate that DAT achieves state-of-the-art performances across a variety of text-related benchmarks.
arXiv Detail & Related papers (2024-05-30T07:25:23Z) - TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding [91.30065932213758]
Large Multimodal Models (LMMs) have sparked a surge in research aimed at harnessing their remarkable reasoning abilities.
We propose TextCoT, a novel Chain-of-Thought framework for text-rich image understanding.
Our method is free of extra training, offering immediate plug-and-play functionality.
arXiv Detail & Related papers (2024-04-15T13:54:35Z) - ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models [92.60282074937305]
We introduce ConTextual, a novel dataset featuring human-crafted instructions that require context-sensitive reasoning for text-rich images.
We conduct experiments to assess the performance of 14 foundation models and establish a human performance baseline.
We observe a significant performance gap of 30.8% between GPT-4V and human performance.
arXiv Detail & Related papers (2024-01-24T09:07:11Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval [13.061063817876336]
We propose a novel Hierarchical Graph Alignment Network (HGAN) for image-text retrieval.
First, to capture the comprehensive multimodal features, we construct the feature graphs for the image and text modality respectively.
Then, a multi-granularity shared space is established with a designed Multi-granularity Feature Aggregation and Rearrangement (MFAR) module.
Finally, the ultimate image and text features are further refined through three-level similarity functions to achieve the hierarchical alignment.
arXiv Detail & Related papers (2022-12-16T05:08:52Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.