BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual
Questions
- URL: http://arxiv.org/abs/2308.09936v3
- Date: Mon, 18 Dec 2023 04:33:17 GMT
- Title: BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual
Questions
- Authors: Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, Zhuowen Tu
- Abstract summary: Vision Language Models (VLMs) cannot accurately interpret images infused with text.
The present study introduces BLIVA: an augmented version of InstructBLIP with Visual Assistant.
Our model significantly enhances performance in processing text-rich VQA benchmarks and in undertaking general (not particularly text-rich) VQA benchmarks.
- Score: 41.825273034537204
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Language Models (VLMs), which extend Large Language Models (LLM) by
incorporating visual understanding capability, have demonstrated significant
advancements in addressing open-ended visual question-answering (VQA) tasks.
However, these models cannot accurately interpret images infused with text, a
common occurrence in real-world scenarios. Standard procedures for extracting
information from images often involve learning a fixed set of query embeddings.
These embeddings are designed to encapsulate image contexts and are later used
as soft prompt inputs in LLMs. Yet, this process is limited to the token count,
potentially curtailing the recognition of scenes with text-rich context. To
improve upon them, the present study introduces BLIVA: an augmented version of
InstructBLIP with Visual Assistant. BLIVA incorporates the query embeddings
from InstructBLIP and also directly projects encoded patch embeddings into the
LLM, a technique inspired by LLaVA. This approach assists the model to capture
intricate details potentially missed during the query decoding process.
Empirical evidence demonstrates that our model, BLIVA, significantly enhances
performance in processing text-rich VQA benchmarks (up to 17.76% in OCR-VQA
benchmark) and in undertaking general (not particularly text-rich) VQA
benchmarks (up to 7.9% in Visual Spatial Reasoning benchmark), and achieved
17.72% overall improvement in a comprehensive multimodal LLM benchmark (MME),
comparing to our baseline InstructBLIP. BLIVA demonstrates significant
capability in decoding real-world images, irrespective of text presence. To
demonstrate the broad industry applications enabled by BLIVA, we evaluate the
model using a new dataset comprising YouTube thumbnails paired with
question-answer sets across 11 diverse categories. Our code and models are
freely accessible at https://github.com/mlpc-ucsd/BLIVA.
Related papers
- FTII-Bench: A Comprehensive Multimodal Benchmark for Flow Text with Image Insertion [7.322448493179106]
Flow Text with Image Insertion task requires LVLMs to simultaneously possess outstanding abilities in image comprehension, instruction understanding, and long-text interpretation.
We introduce the Flow Text with Image Insertion Benchmark (FTII-Bench), which includes 318 high-quality Chinese image-text news articles and 307 high-quality English image-text news articles, covering 10 different news domains.
We evaluate 9 open-source and 2 closed-source LVLMs as well as 2 CLIP-based models.
arXiv Detail & Related papers (2024-10-16T13:38:31Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models [10.41857522464292]
We introduce the MultiModal Needle-in-a-haystack (MMNeedle) benchmark to assess the long-context capabilities of MLLMs.
We employ image stitching to further increase the input context length, and develop a protocol to automatically generate labels for sub-image level retrieval.
We evaluate state-of-the-art MLLMs, encompassing both API-based and open-source models.
arXiv Detail & Related papers (2024-06-17T05:54:06Z) - VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension [62.40482764691584]
We introduce SEED-Bench-2-Plus, a benchmark specifically designed for evaluating textbftext-rich visual comprehension of MLLMs.
Our benchmark comprises 2.3K multiple-choice questions with precise human annotations, spanning three broad categories: Charts, Maps, and Webs.
We conduct a thorough evaluation involving 34 prominent MLLMs and emphasize the current limitations of MLLMs in text-rich visual comprehension.
arXiv Detail & Related papers (2024-04-25T17:39:35Z) - CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension.
Our findings indicate that MLLMs consistently fall short of human performance on this benchmark.
This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z) - Filling the Image Information Gap for VQA: Prompting Large Language
Models to Proactively Ask Questions [15.262736501208467]
Large Language Models (LLMs) demonstrate impressive reasoning ability and the maintenance of world knowledge.
As images are invisible to LLMs, researchers convert images to text to engage LLMs into the visual question reasoning procedure.
We design a framework that enables LLMs to proactively ask relevant questions to unveil more details in the image.
arXiv Detail & Related papers (2023-11-20T08:23:39Z) - Pink: Unveiling the Power of Referential Comprehension for Multi-modal
LLMs [49.88461345825586]
This paper proposes a new framework to enhance the fine-grained image understanding abilities of MLLMs.
We present a new method for constructing the instruction tuning dataset at a low cost by leveraging annotations in existing datasets.
We show that our model exhibits a 5.2% accuracy improvement over Qwen-VL and surpasses the accuracy of Kosmos-2 by 24.7%.
arXiv Detail & Related papers (2023-10-01T05:53:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.