SnapNTell: Enhancing Entity-Centric Visual Question Answering with
Retrieval Augmented Multimodal LLM
- URL: http://arxiv.org/abs/2403.04735v1
- Date: Thu, 7 Mar 2024 18:38:17 GMT
- Title: SnapNTell: Enhancing Entity-Centric Visual Question Answering with
Retrieval Augmented Multimodal LLM
- Authors: Jielin Qiu, Andrea Madotto, Zhaojiang Lin, Paul A. Crook, Yifan Ethan
Xu, Xin Luna Dong, Christos Faloutsos, Lei Li, Babak Damavandi, Seungwhan
Moon
- Abstract summary: This work introduces a novel evaluative benchmark named textbfSnapNTell, specifically tailored for entity-centric VQA.
The dataset is organized into 22 major categories, containing 7,568 unique entities in total.
Our approach markedly outperforms existing methods on the SnapNTell dataset, achieving a 66.5% improvement in the BELURT score.
- Score: 48.15067480282839
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-extended LLMs have made significant strides in Visual Question
Answering (VQA). Despite these advancements, VLLMs still encounter substantial
difficulties in handling queries involving long-tail entities, with a tendency
to produce erroneous or hallucinated responses. In this work, we introduce a
novel evaluative benchmark named \textbf{SnapNTell}, specifically tailored for
entity-centric VQA. This task aims to test the models' capabilities in
identifying entities and providing detailed, entity-specific knowledge. We have
developed the \textbf{SnapNTell Dataset}, distinct from traditional VQA
datasets: (1) It encompasses a wide range of categorized entities, each
represented by images and explicitly named in the answers; (2) It features QA
pairs that require extensive knowledge for accurate responses. The dataset is
organized into 22 major categories, containing 7,568 unique entities in total.
For each entity, we curated 10 illustrative images and crafted 10
knowledge-intensive QA pairs. To address this novel task, we devised a
scalable, efficient, and transparent retrieval-augmented multimodal LLM. Our
approach markedly outperforms existing methods on the SnapNTell dataset,
achieving a 66.5\% improvement in the BELURT score. We will soon make the
dataset and the source code publicly accessible.
Related papers
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Distill Visual Chart Reasoning Ability from LLMs to MLLMs [38.62832112530892]
Solving complex chart Q&A tasks requires advanced visual reasoning abilities in multimodal large language models (MLLMs)
We propose Code-as-Intermediary Translation (CIT), a cost-effective, efficient and easily scalable data synthesis method for distilling visual reasoning abilities from LLMs to MLLMs.
We employ text-based synthesizing techniques to construct chart-plotting code and produce ReachQA, a dataset containing 3k reasoning-intensive charts and 20k Q&A pairs.
arXiv Detail & Related papers (2024-10-24T14:50:42Z) - IDEAL: Leveraging Infinite and Dynamic Characterizations of Large Language Models for Query-focused Summarization [59.06663981902496]
Query-focused summarization (QFS) aims to produce summaries that answer particular questions of interest, enabling greater user control and personalization.
We investigate two indispensable characteristics that the LLMs-based QFS models should be harnessed, Lengthy Document Summarization and Efficiently Fine-grained Query-LLM Alignment.
These innovations pave the way for broader application and accessibility in the field of QFS technology.
arXiv Detail & Related papers (2024-07-15T07:14:56Z) - Automatic Question-Answer Generation for Long-Tail Knowledge [65.11554185687258]
We propose an automatic approach to generate specialized QA datasets for tail entities.
We conduct extensive experiments by employing pretrained LLMs on our newly generated long-tail QA datasets.
arXiv Detail & Related papers (2024-03-03T03:06:31Z) - SPARQL Generation: an analysis on fine-tuning OpenLLaMA for Question
Answering over a Life Science Knowledge Graph [0.0]
We evaluate strategies for fine-tuning the OpenLlama LLM for question answering over life science knowledge graphs.
We propose an end-to-end data augmentation approach for extending a set of existing queries over a given knowledge graph.
We also investigate the role of semantic "clues" in the queries, such as meaningful variable names and inline comments.
arXiv Detail & Related papers (2024-02-07T07:24:01Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models [21.95962189710859]
We propose a lightweight, end-to-end framework to execute the Spoken Question Answering (SQA) task on the LibriSQA dataset.
By reforming ASR into the SQA format, we further substantiate our framework's capability in handling ASR tasks.
Our empirical findings bolster the LLMs' aptitude for aligning and comprehending multimodal information, paving the way for the development of universal multimodal LLMs.
arXiv Detail & Related papers (2023-08-20T23:47:23Z) - Enhancing In-Context Learning with Answer Feedback for Multi-Span
Question Answering [9.158919909909146]
In this paper, we propose a novel way of employing labeled data such as it informs LLM of some undesired output.
Experiments on three multi-span question answering datasets and a keyphrase extraction dataset show that our new prompting strategy consistently improves LLM's in-context learning performance.
arXiv Detail & Related papers (2023-06-07T15:20:24Z) - End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries.
We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion.
We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.