Related papers: Analysis on Image Set Visual Question Answering

Analysis on Image Set Visual Question Answering

URL: http://arxiv.org/abs/2104.00107v1
Date: Wed, 31 Mar 2021 20:47:32 GMT
Title: Analysis on Image Set Visual Question Answering
Authors: Abhinav Khattar, Aviral Joshi, Har Simrat Singh, Pulkit Goel, Rohit Prakash Barnwal
Abstract summary: We tackle the challenge of Visual Question Answering in multi-image setting. Traditional VQA tasks have focused on a single-image setting where the target answer is generated from a single image. In this report, we work with 4 approaches in a bid to improve the performance on the task.
Score: 0.3359875577705538
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We tackle the challenge of Visual Question Answering in multi-image setting for the ISVQA dataset. Traditional VQA tasks have focused on a single-image setting where the target answer is generated from a single image. Image set VQA, however, comprises of a set of images and requires finding connection between images, relate the objects across images based on these connections and generate a unified answer. In this report, we work with 4 approaches in a bid to improve the performance on the task. We analyse and compare our results with three baseline models - LXMERT, HME-VideoQA and VisualBERT - and show that our approaches can provide a slight improvement over the baselines. In specific, we try to improve on the spatial awareness of the model and help the model identify color using enhanced pre-training, reduce language dependence using adversarial regularization, and improve counting using regression loss and graph based deduplication. We further delve into an in-depth analysis on the language bias in the ISVQA dataset and show how models trained on ISVQA implicitly learn to associate language more strongly with the final answer.

Related papers

RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning [63.599057862999]
RefChartQA is a novel benchmark that integrates Chart Question Answering (ChartQA) with visual grounding. Our experiments demonstrate that incorporating spatial awareness via grounding improves response accuracy by over 15%.
arXiv Detail & Related papers (2025-03-29T15:50:08Z)
Multimodal Hypothetical Summary for Retrieval-based Multi-image Question Answering [14.63910474388089]
"Retrieve-then-answer" pipelines often suffer from cascading errors because the training objective of QA fails to optimize the retrieval stage. We propose a novel method to effectively introduce and reference retrieved information into the QA. Our approach achieves a 3.7% absolute improvement over state-of-the-art methods on RETVQA and a 14.5% improvement over CLIP.
arXiv Detail & Related papers (2024-12-19T14:17:09Z)
Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference for image descriptions using unlabeled images. To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z)
Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval [92.13664084464514]
The task of composed image retrieval (CIR) aims to retrieve images based on the query image and the text describing the users' intent. Existing methods have made great progress with the advanced large vision-language (VL) model in CIR task, however, they generally suffer from two main issues: lack of labeled triplets for model training and difficulty of deployment on resource-restricted environments. We propose Image2Sentence based Asymmetric zero-shot composed image retrieval (ISA), which takes advantage of the VL model and only relies on unlabeled images for composition learning.
arXiv Detail & Related papers (2024-03-03T07:58:03Z)
VQAttack: Transferable Adversarial Attacks on Visual Question Answering via Pre-trained Models [58.21452697997078]
We propose a novel VQAttack model, which can generate both image and text perturbations with the designed modules. Experimental results on two VQA datasets with five validated models demonstrate the effectiveness of the proposed VQAttack.
arXiv Detail & Related papers (2024-02-16T21:17:42Z)
Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model. We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z)
Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering [7.640416680391081]
In this paper, we explore effective prompting techniques to enhance zero- and few-shot Visual Question Answering (VQA) performance. We identify that specific templates significantly influence VQA outcomes, underscoring the need for strategic template selection. To mitigate the challenges associated with evaluating free-form open-ended VQA responses, we introduce a straightforward LLM-guided pre-processing technique.
arXiv Detail & Related papers (2023-06-16T17:47:57Z)
Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images! [103.09776737512077]
SelTDA (Self-Taught Data Augmentation) is a strategy for finetuning large vision language models on small-scale VQA datasets. It generates question-answer pseudolabels directly conditioned on an image, allowing us to pseudolabel unlabeled images. We describe a series of experiments showing that our self-taught data augmentation increases robustness to adversarially searched questions.
arXiv Detail & Related papers (2023-06-06T18:00:47Z)
Multilingual Augmentation for Robust Visual Question Answering in Remote Sensing Images [19.99615698375829]
We propose a contrastive learning strategy for training robust RSVQA models against diverse question templates and words. Experimental results demonstrate that the proposed augmented dataset is effective in improving the robustness of the RSVQA model.
arXiv Detail & Related papers (2023-04-07T21:06:58Z)
COIN: Counterfactual Image Generation for VQA Interpretation [5.994412766684842]
We introduce an interpretability approach for VQA models by generating counterfactual images. In addition to interpreting the result of VQA models on single images, the obtained results and the discussion provides an extensive explanation of VQA models' behaviour.
arXiv Detail & Related papers (2022-01-10T13:51:35Z)
How to find a good image-text embedding for remote sensing visual question answering? [41.0510495281302]
Visual question answering (VQA) has been introduced to remote sensing to make information extraction from overhead imagery more accessible to everyone. We study three different fusion methodologies in the context of VQA for remote sensing and analyse the gains in accuracy with respect to the model complexity.
arXiv Detail & Related papers (2021-09-24T09:48:28Z)
Learning from Lexical Perturbations for Consistent Visual Question Answering [78.21912474223926]
Existing Visual Question Answering (VQA) models are often fragile and sensitive to input variations. We propose a novel approach to address this issue based on modular networks, which creates two questions related by linguistic perturbations. We also present VQA Perturbed Pairings (VQA P2), a new, low-cost benchmark and augmentation pipeline to create controllable linguistic variations.
arXiv Detail & Related papers (2020-11-26T17:38:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.