What Large Language Models Bring to Text-rich VQA?
- URL: http://arxiv.org/abs/2311.07306v1
- Date: Mon, 13 Nov 2023 12:52:29 GMT
- Title: What Large Language Models Bring to Text-rich VQA?
- Authors: Xuejing Liu, Wei Tang, Xinzhe Ni, Jinghui Lu, Rui Zhao, Zechao Li and
Fei Tan
- Abstract summary: Text-rich VQA, namely Visual Question Answering based on text recognition in the images, is a cross-modal task that requires both image comprehension and text recognition.
To address the above concern, we leverage external OCR models to recognize texts in the image and Large Language Models (LLMs) to answer the question given texts.
This pipeline achieved superior performance compared to the majority of existing Multimodal Large Language Models (MLLM) on four text-rich VQA datasets.
- Score: 38.569505870771025
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Text-rich VQA, namely Visual Question Answering based on text recognition in
the images, is a cross-modal task that requires both image comprehension and
text recognition. In this work, we focus on investigating the advantages and
bottlenecks of LLM-based approaches in addressing this problem. To address the
above concern, we separate the vision and language modules, where we leverage
external OCR models to recognize texts in the image and Large Language Models
(LLMs) to answer the question given texts. The whole framework is training-free
benefiting from the in-context ability of LLMs. This pipeline achieved superior
performance compared to the majority of existing Multimodal Large Language
Models (MLLM) on four text-rich VQA datasets. Besides, based on the ablation
study, we find that LLM brings stronger comprehension ability and may introduce
helpful knowledge for the VQA problem. The bottleneck for LLM to address
text-rich VQA problems may primarily lie in visual part. We also combine the
OCR module with MLLMs and pleasantly find that the combination of OCR module
with MLLM also works. It's worth noting that not all MLLMs can comprehend the
OCR information, which provides insights into how to train an MLLM that
preserves the abilities of LLM.
Related papers
- MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.
It aims to localize instances of interest across multiple images based on open-ended text prompts.
We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - MATE: Meet At The Embedding -- Connecting Images with Long Texts [37.27283238166393]
Meet At The Embedding (MATE) is a novel approach that combines the capabilities of Large Language Models (LLMs) with Vision Language Models (VLMs)
We replace the text encoder of the VLM with a pretrained LLM-based encoder that excels in understanding long texts.
We propose two new cross-modal retrieval benchmarks to assess the task of connecting images with long texts.
arXiv Detail & Related papers (2024-06-26T14:10:00Z) - Video Understanding with Large Language Models: A Survey [97.29126722004949]
Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding.
The emergent capabilities Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity reasoning.
This survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs.
arXiv Detail & Related papers (2023-12-29T01:56:17Z) - MLLMs-Augmented Visual-Language Representation Learning [70.5293060238008]
We demonstrate that Multi-modal Large Language Models (MLLMs) can enhance visual-language representation learning.
Our approach is simple, utilizing MLLMs to extend multiple diverse captions for each image.
We propose "text shearing" to maintain the quality and availability of extended captions.
arXiv Detail & Related papers (2023-11-30T18:05:52Z) - Filling the Image Information Gap for VQA: Prompting Large Language
Models to Proactively Ask Questions [15.262736501208467]
Large Language Models (LLMs) demonstrate impressive reasoning ability and the maintenance of world knowledge.
As images are invisible to LLMs, researchers convert images to text to engage LLMs into the visual question reasoning procedure.
We design a framework that enables LLMs to proactively ask relevant questions to unveil more details in the image.
arXiv Detail & Related papers (2023-11-20T08:23:39Z) - MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning [42.68425777473114]
Vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity.
We introduce vision-language Model with Multi-Modal In-Context Learning (MMICL), a new approach to allow the VLM to deal with multi-modal inputs efficiently.
Our experiments confirm that MMICL achieves new state-of-the-art zero-shot performance on a wide range of general vision-language tasks.
arXiv Detail & Related papers (2023-09-14T17:59:17Z) - LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models [21.95962189710859]
We propose a lightweight, end-to-end framework to execute the Spoken Question Answering (SQA) task on the LibriSQA dataset.
By reforming ASR into the SQA format, we further substantiate our framework's capability in handling ASR tasks.
Our empirical findings bolster the LLMs' aptitude for aligning and comprehending multimodal information, paving the way for the development of universal multimodal LLMs.
arXiv Detail & Related papers (2023-08-20T23:47:23Z) - From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language
Models [111.42052290293965]
Large language models (LLMs) have demonstrated excellent zero-shot generalization to new language tasks.
End-to-end training on vision and language data may bridge the disconnections, but is inflexible and computationally expensive.
We propose emphImg2Prompt, a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections.
arXiv Detail & Related papers (2022-12-21T08:39:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.