Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
- URL: http://arxiv.org/abs/2404.12020v2
- Date: Mon, 20 May 2024 00:45:35 GMT
- Title: Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
- Authors: Jie Ma, Min Hu, Pinghui Wang, Wangchun Sun, Lingyun Song, Hongbin Pei, Jun Liu, Youtian Du,
- Abstract summary: Audio-Visual Question Answering (AVQA) is a complex multi-modal reasoning task.
Current approaches are prone to overlearning dataset biases, resulting in poor robustness.
We propose a novel dataset, textitMUSIC-AVQA-R, crafted in two steps: rephrasing questions within the test split of a public dataset and introducing distribution shifts to split questions.
- Score: 25.577314828249897
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-Visual Question Answering (AVQA) is a complex multi-modal reasoning task, demanding intelligent systems to accurately respond to natural language queries based on audio-video input pairs. Nevertheless, prevalent AVQA approaches are prone to overlearning dataset biases, resulting in poor robustness. Furthermore, current datasets may not provide a precise diagnostic for these methods. To tackle these challenges, firstly, we propose a novel dataset, \textit{MUSIC-AVQA-R}, crafted in two steps: rephrasing questions within the test split of a public dataset (\textit{MUSIC-AVQA}) and subsequently introducing distribution shifts to split questions. The former leads to a large, diverse test space, while the latter results in a comprehensive robustness evaluation on rare, frequent, and overall questions. Secondly, we propose a robust architecture that utilizes a multifaceted cycle collaborative debiasing strategy to overcome bias learning. Experimental results show that this architecture achieves state-of-the-art performance on both datasets, especially obtaining a significant improvement of 9.32\% on the proposed dataset. Extensive ablation experiments are conducted on these two datasets to validate the effectiveness of the debiasing strategy. Additionally, we highlight the limited robustness of existing multi-modal QA methods through the evaluation on our dataset.
Related papers
- SEMQA: Semi-Extractive Multi-Source Question Answering [94.04430035121136]
We introduce a new QA task for answering multi-answer questions by summarizing multiple diverse sources in a semi-extractive fashion.
We create the first dataset of this kind, QuoteSum, with human-written semi-extractive answers to natural and generated questions.
arXiv Detail & Related papers (2023-11-08T18:46:32Z) - Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and
Beyond [93.96982273042296]
Vision-language (VL) understanding tasks evaluate models' comprehension of complex visual scenes through multiple-choice questions.
We have identified two dataset biases that models can exploit as shortcuts to resolve various VL tasks correctly without proper understanding.
We propose Adversarial Data Synthesis (ADS) to generate synthetic training and debiased evaluation data.
We then introduce Intra-sample Counterfactual Training (ICT) to assist models in utilizing the synthesized training data, particularly the counterfactual data, via focusing on intra-sample differentiation.
arXiv Detail & Related papers (2023-10-23T08:09:42Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Tackling Data Bias in MUSIC-AVQA: Crafting a Balanced Dataset for
Unbiased Question-Answering [10.00386025149584]
We review each question type from the original dataset, selecting those with pronounced answer biases.
We gather complementary videos and questions, ensuring that no answers have outstanding skewed distribution.
We present a novel baseline model that delves deeper into the audio-visual-text interrelation.
arXiv Detail & Related papers (2023-10-10T01:22:41Z) - Read, Look or Listen? What's Needed for Solving a Multimodal Dataset [7.0430001782867]
We propose a two-step method to analyze multimodal datasets, which leverages a small seed of human annotation to map each multimodal instance to the modalities required to process it.
We apply our approach to TVQA, a video question-answering dataset, and discover that most questions can be answered using a single modality, without a substantial bias towards any specific modality.
We analyze the MERLOT Reserve, finding that it struggles with image-based questions compared to text and audio, but also with auditory speaker identification.
arXiv Detail & Related papers (2023-07-06T08:02:45Z) - Mixed-modality Representation Learning and Pre-training for Joint
Table-and-Text Retrieval in OpenQA [85.17249272519626]
An optimized OpenQA Table-Text Retriever (OTTeR) is proposed.
We conduct retrieval-centric mixed-modality synthetic pre-training.
OTTeR substantially improves the performance of table-and-text retrieval on the OTT-QA dataset.
arXiv Detail & Related papers (2022-10-11T07:04:39Z) - Modern Question Answering Datasets and Benchmarks: A Survey [5.026863544662493]
Question Answering (QA) is one of the most important natural language processing (NLP) tasks.
It aims using NLP technologies to generate a corresponding answer to a given question based on the massive unstructured corpus.
In this paper, we investigate influential QA datasets that have been released in the era of deep learning.
arXiv Detail & Related papers (2022-06-30T05:53:56Z) - The Multimodal Sentiment Analysis in Car Reviews (MuSe-CaR) Dataset:
Collection, Insights and Improvements [14.707930573950787]
We present MuSe-CaR, a first of its kind multimodal dataset.
The data is publicly available as it recently served as the testing bed for the 1st Multimodal Sentiment Analysis Challenge.
arXiv Detail & Related papers (2021-01-15T10:40:37Z) - Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management.
We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z) - Improving Multi-Turn Response Selection Models with Complementary
Last-Utterance Selection by Instance Weighting [84.9716460244444]
We consider utilizing the underlying correlation in the data resource itself to derive different kinds of supervision signals.
We conduct extensive experiments in two public datasets and obtain significant improvement in both datasets.
arXiv Detail & Related papers (2020-02-18T06:29:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.