UReader: Universal OCR-free Visually-situated Language Understanding
with Multimodal Large Language Model
- URL: http://arxiv.org/abs/2310.05126v1
- Date: Sun, 8 Oct 2023 11:33:09 GMT
- Title: UReader: Universal OCR-free Visually-situated Language Understanding
with Multimodal Large Language Model
- Authors: Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu,
Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, Qin Jin, Liang He, Xin Alex
Lin, Fei Huang
- Abstract summary: We propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal Large Language Model (MLLM)
By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters.
Our single model achieves state-of-the-art ocr-free performance in 8 out of 10 visually-situated language understanding tasks.
- Score: 108.85584502396182
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text is ubiquitous in our visual world, conveying crucial information, such
as in documents, websites, and everyday photographs. In this work, we propose
UReader, a first exploration of universal OCR-free visually-situated language
understanding based on the Multimodal Large Language Model (MLLM). By
leveraging the shallow text recognition ability of the MLLM, we only finetuned
1.2% parameters and the training cost is much lower than previous work
following domain-specific pretraining and finetuning paradigms. Concretely,
UReader is jointly finetuned on a wide range of Visually-situated Language
Understanding tasks via a unified instruction format. To enhance the visual
text and semantic understanding, we further apply two auxiliary tasks with the
same format, namely text reading and key points generation tasks. We design a
shape-adaptive cropping module before the encoder-decoder architecture of MLLM
to leverage the frozen low-resolution vision encoder for processing
high-resolution images. Without downstream finetuning, our single model
achieves state-of-the-art ocr-free performance in 8 out of 10 visually-situated
language understanding tasks, across 5 domains: documents, tables, charts,
natural images, and webpage screenshots. Codes and instruction-tuning datasets
will be released.
Related papers
- VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval [10.603148564713518]
We present a new embedding model VISTA for universal multi-modal retrieval.
We introduce a flexible architecture which extends a powerful text encoder with the image understanding capability.
Secondly, we develop two data generation strategies, which bring high-quality composed image-text to facilitate the training of the embedding model.
arXiv Detail & Related papers (2024-06-06T17:37:47Z) - StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond [68.0107158115377]
We have crafted an efficient vision-language model, StrucTexTv3, tailored to tackle various intelligent tasks for text-rich images.
We enhance the perception and comprehension abilities of StrucTexTv3 through instruction learning.
Our method achieved SOTA results in text-rich image perception tasks, and significantly improved performance in comprehension tasks.
arXiv Detail & Related papers (2024-05-31T16:55:04Z) - mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding [100.17063271791528]
We propose the Unified Structure Learning to boost the performance of MLLMs.
Our model DocOwl 1.5 achieves state-of-the-art performance on 10 visual document understanding benchmarks.
arXiv Detail & Related papers (2024-03-19T16:48:40Z) - MLLMs-Augmented Visual-Language Representation Learning [70.5293060238008]
We demonstrate that Multi-modal Large Language Models (MLLMs) can enhance visual-language representation learning.
Our approach is simple, utilizing MLLMs to extend multiple diverse captions for each image.
We propose "text shearing" to maintain the quality and availability of extended captions.
arXiv Detail & Related papers (2023-11-30T18:05:52Z) - mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document
Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page.
Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z) - Towards Models that Can See and Read [12.078407046266982]
Visual Question Answering (VQA) and Image Captioning (CAP) have analogous scene-text versions that require reasoning from the text in the image.
We propose UniTNT, a Unified Text-Non-Text approach, which grants existing multimodal scene-text understanding capabilities.
We show that scene-text understanding capabilities can boost vision-language models' performance on general VQA and CAP by up to 2.69% and 0.6 CIDEr, respectively.
arXiv Detail & Related papers (2023-01-18T09:36:41Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.