Towards Improving Document Understanding: An Exploration on
Text-Grounding via MLLMs
- URL: http://arxiv.org/abs/2311.13194v2
- Date: Fri, 15 Dec 2023 08:12:32 GMT
- Title: Towards Improving Document Understanding: An Exploration on
Text-Grounding via MLLMs
- Authors: Yonghui Wang, Wengang Zhou, Hao Feng, Keyi Zhou, Houqiang Li
- Abstract summary: We present a text-grounding document understanding model, termed TGDoc, which enhances MLLMs with the ability to discern the spatial positioning of text within images.
We formulate instruction tuning tasks including text detection, recognition, and spotting to facilitate the cohesive alignment between the visual encoder and large language model.
Our method achieves state-of-the-art performance across multiple text-rich benchmarks, validating the effectiveness of our method.
- Score: 96.54224331778195
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the field of document understanding, significant advances have been made
in the fine-tuning of Multimodal Large Language Models (MLLMs) with
instruction-following data. Nevertheless, the potential of text-grounding
capability within text-rich scenarios remains underexplored. In this paper, we
present a text-grounding document understanding model, termed TGDoc, which
addresses this deficiency by enhancing MLLMs with the ability to discern the
spatial positioning of text within images. Empirical evidence suggests that
text-grounding improves the model's interpretation of textual content, thereby
elevating its proficiency in comprehending text-rich images. Specifically, we
compile a dataset containing 99K PowerPoint presentations sourced from the
internet. We formulate instruction tuning tasks including text detection,
recognition, and spotting to facilitate the cohesive alignment between the
visual encoder and large language model. Moreover, we curate a collection of
text-rich images and prompt the text-only GPT-4 to generate 12K high-quality
conversations, featuring textual locations within text-rich scenarios. By
integrating text location data into the instructions, TGDoc is adept at
discerning text locations during the visual question process. Extensive
experiments demonstrate that our method achieves state-of-the-art performance
across multiple text-rich benchmarks, validating the effectiveness of our
method.
Related papers
- DoPTA: Improving Document Layout Analysis using Patch-Text Alignment [3.3181276611945267]
We present a novel image-text alignment technique specially designed for leveraging the textual information in document images to improve performance on visual tasks.
Our document encoder model DoPTA - trained with this technique demonstrates strong performance on a wide range of document image understanding tasks, without requiring OCR during inference.
DoPTA also sets new state-of-the art results on D4LA, and FUNSD, two challenging document visual analysis benchmarks.
arXiv Detail & Related papers (2024-12-17T13:26:31Z) - Towards Unified Multi-granularity Text Detection with Interactive Attention [56.79437272168507]
"Detect Any Text" is an advanced paradigm that unifies scene text detection, layout analysis, and document page detection into a cohesive, end-to-end model.
A pivotal innovation in DAT is the across-granularity interactive attention module, which significantly enhances the representation learning of text instances.
Tests demonstrate that DAT achieves state-of-the-art performances across a variety of text-related benchmarks.
arXiv Detail & Related papers (2024-05-30T07:25:23Z) - TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding [91.30065932213758]
Large Multimodal Models (LMMs) have sparked a surge in research aimed at harnessing their remarkable reasoning abilities.
We propose TextCoT, a novel Chain-of-Thought framework for text-rich image understanding.
Our method is free of extra training, offering immediate plug-and-play functionality.
arXiv Detail & Related papers (2024-04-15T13:54:35Z) - ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models [92.60282074937305]
We introduce ConTextual, a novel dataset featuring human-crafted instructions that require context-sensitive reasoning for text-rich images.
We conduct experiments to assess the performance of 14 foundation models and establish a human performance baseline.
We observe a significant performance gap of 30.8% between GPT-4V and human performance.
arXiv Detail & Related papers (2024-01-24T09:07:11Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.