SparrowVQE: Visual Question Explanation for Course Content Understanding
- URL: http://arxiv.org/abs/2411.07516v1
- Date: Tue, 12 Nov 2024 03:25:33 GMT
- Title: SparrowVQE: Visual Question Explanation for Course Content Understanding
- Authors: Jialu Li, Manish Kumar Thota, Ruslan Gokhman, Radek Holik, Youshan Zhang,
- Abstract summary: This paper introduces Visual Question Explanation (VQE), which enhances the ability of VQA to provide detailed explanations.
We trained our model with a three-stage training mechanism consisting of multimodal pre-training, instruction tuning, and domain fine-tuning.
Experimental results demonstrate that our SparrowVQE achieves better performance in our developed MLVQE dataset and outperforms state-of-the-art methods in the other five benchmark VQA datasets.
- Score: 12.926309478839652
- License:
- Abstract: Visual Question Answering (VQA) research seeks to create AI systems to answer natural language questions in images, yet VQA methods often yield overly simplistic and short answers. This paper aims to advance the field by introducing Visual Question Explanation (VQE), which enhances the ability of VQA to provide detailed explanations rather than brief responses and address the need for more complex interaction with visual content. We first created an MLVQE dataset from a 14-week streamed video machine learning course, including 885 slide images, 110,407 words of transcripts, and 9,416 designed question-answer (QA) pairs. Next, we proposed a novel SparrowVQE, a small 3 billion parameters multimodal model. We trained our model with a three-stage training mechanism consisting of multimodal pre-training (slide images and transcripts feature alignment), instruction tuning (tuning the pre-trained model with transcripts and QA pairs), and domain fine-tuning (fine-tuning slide image and QA pairs). Eventually, our SparrowVQE can understand and connect visual information using the SigLIP model with transcripts using the Phi-2 language model with an MLP adapter. Experimental results demonstrate that our SparrowVQE achieves better performance in our developed MLVQE dataset and outperforms state-of-the-art methods in the other five benchmark VQA datasets. The source code is available at \url{https://github.com/YoushanZhang/SparrowVQE}.
Related papers
- SimpsonsVQA: Enhancing Inquiry-Based Learning with a Tailored Dataset [11.729464930866483]
"SimpsonsVQA" is a novel dataset for VQA derived from The Simpsons TV show.
It is designed to address not only the traditional VQA task but also to identify irrelevant questions related to images.
SimpsonsVQA contains approximately 23K images, 166K QA pairs, and 500K judgments.
arXiv Detail & Related papers (2024-10-30T02:30:40Z) - Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering [7.429984955853609]
We present Q-ViD, a simple approach for video question answering (video QA)
Q-ViD relies on a single instruction-aware open vision-language model (InstructBLIP) to tackle videoQA using frame descriptions.
arXiv Detail & Related papers (2024-02-16T13:59:07Z) - Multiple-Question Multiple-Answer Text-VQA [19.228969692887603]
Multiple-Question Multiple-Answer (MQMA) is a novel approach to do text-VQA in encoder-decoder transformer models.
MQMA takes multiple questions and content as input at the encoder and predicts multiple answers at the decoder in an auto-regressive manner.
We propose a novel MQMA denoising pre-training task which is designed to teach the model to align and delineate multiple questions and content with associated answers.
arXiv Detail & Related papers (2023-11-15T01:00:02Z) - Zero-shot Translation of Attention Patterns in VQA Models to Natural
Language [65.94419474119162]
ZS-A2T is a framework that translates the transformer attention of a given model into natural language without requiring any training.
We consider this in the context of Visual Question Answering (VQA)
Our framework does not require any training and allows the drop-in replacement of different guiding sources.
arXiv Detail & Related papers (2023-11-08T22:18:53Z) - Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z) - Modular Visual Question Answering via Code Generation [134.59005611826777]
We present a framework that formulates visual question answering as modular code generation.
Our approach requires no additional training and relies on pre-trained language models (LMs), visual models pre-trained on image-caption pairs, and fifty VQA examples used for in-context learning.
Our approach improves accuracy on the COVR dataset by at least 3% and on the GQA dataset by roughly 2% compared to the few-shot baseline that does not employ code generation.
arXiv Detail & Related papers (2023-06-08T17:45:14Z) - NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for
Autonomous Driving Scenario [77.14723238359318]
NuScenesQA is the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs.
We leverage existing 3D detection annotations to generate scene graphs and design question templates manually.
We develop a series of baselines that employ advanced 3D detection and VQA techniques.
arXiv Detail & Related papers (2023-05-24T07:40:50Z) - Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models
with Zero Training [82.30343537942608]
We propose Plug-and-Play VQA (PNP-VQA), a modular framework for zero-shot VQA.
We first generate question-guided informative image captions, and pass the captions to a PLM as context for question answering.
PNP-VQA achieves state-of-the-art results on zero-shot VQAv2 and GQA.
arXiv Detail & Related papers (2022-10-17T06:29:54Z) - Video Question Answering with Iterative Video-Text Co-Tokenization [77.66445727743508]
We propose a novel multi-stream video encoder for video question answering.
We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA.
Our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.
arXiv Detail & Related papers (2022-08-01T15:35:38Z) - From Easy to Hard: Learning Language-guided Curriculum for Visual
Question Answering on Remote Sensing Data [27.160303686163164]
Visual question answering (VQA) for remote sensing scene has great potential in intelligent human-computer interaction system.
No object annotations are available in RSVQA datasets, which makes it difficult for models to exploit informative region representation.
There are questions with clearly different difficulty levels for each image in the RSVQA task.
A multi-level visual feature learning method is proposed to jointly extract language-guided holistic and regional image features.
arXiv Detail & Related papers (2022-05-06T11:37:00Z) - Self-Supervised VQA: Answering Visual Questions using Images and
Captions [38.05223339919346]
VQA models assume the availability of datasets with human-annotated Image-Question-Answer(I-Q-A) triplets for training.
We study whether models can be trained without any human-annotated Q-A pairs, but only with images and associated text captions.
arXiv Detail & Related papers (2020-12-04T01:22:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.