LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image
Understanding
- URL: http://arxiv.org/abs/2306.17107v2
- Date: Fri, 2 Feb 2024 19:44:14 GMT
- Title: LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image
Understanding
- Authors: Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi
Yang, Tong Sun
- Abstract summary: This work enhances the current visual instruction tuning pipeline with text-rich images.
We first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset.
We prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images.
- Score: 85.39419609430453
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instruction tuning unlocks the superior capability of Large Language Models
(LLM) to interact with humans. Furthermore, recent instruction-following
datasets include images as visual inputs, collecting responses for image-based
instructions. However, visual instruction-tuned models cannot comprehend
textual details within images well. This work enhances the current visual
instruction tuning pipeline with text-rich images (e.g., movie posters, book
covers, etc.). Specifically, we first use publicly available OCR tools to
collect results on 422K text-rich images from the LAION dataset. Moreover, we
prompt text-only GPT-4 with recognized texts and image captions to generate 16K
conversations, each containing question-answer pairs for text-rich images. By
combining our collected data with previous multi-modal instruction-following
data, our model, LLaVAR, substantially improves the LLaVA model's capability on
text-based VQA datasets (up to 20% accuracy improvement) while achieving an
accuracy of 91.42% on ScienceQA. The GPT-4-based instruction-following
evaluation also demonstrates the improvement of our model on both natural
images and text-rich images. Through qualitative analysis, LLaVAR shows
promising interaction (e.g., reasoning, writing, and elaboration) skills with
humans based on the latest real-world online content that combines text and
images. We make our code/data/models publicly available at
https://llavar.github.io/.
Related papers
- VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - DOCCI: Descriptions of Connected and Contrasting Images [58.377060316967864]
Descriptions of Connected and Contrasting Images (DOCCI) is a dataset with long, human-annotated English descriptions for 15k images.
We instruct human annotators to create comprehensive descriptions for each image.
We show that DOCCI is a useful testbed for text-to-image generation.
arXiv Detail & Related papers (2024-04-30T17:56:24Z) - ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models [92.60282074937305]
We introduce ConTextual, a novel dataset featuring human-crafted instructions that require context-sensitive reasoning for text-rich images.
We conduct experiments to assess the performance of 14 foundation models and establish a human performance baseline.
We observe a significant performance gap of 30.8% between GPT-4V and human performance.
arXiv Detail & Related papers (2024-01-24T09:07:11Z) - Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training [33.51524424536508]
Iterative Prompt Relabeling (IPR) is a novel algorithm that aligns images to text through iterative image sampling and prompt relabeling with feedback.
We conduct thorough experiments on SDv2 and SDXL, testing their capability to follow instructions on spatial relations.
arXiv Detail & Related papers (2023-12-23T11:10:43Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - ContextCLIP: Contextual Alignment of Image-Text pairs on CLIP visual
representations [4.588028371034406]
We propose ContextCLIP, a contextual and contrastive learning framework for the contextual alignment of image-text pairs.
Our framework was observed to improve the image-text alignment by aligning text and image representations contextually in the joint embedding space.
ContextCLIP showed good qualitative performance for text-to-image retrieval tasks and enhanced classification accuracy.
arXiv Detail & Related papers (2022-11-14T05:17:51Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.