Related papers: Question-Aware Gaussian Experts for Audio-Visual Question Answering

Question-Aware Gaussian Experts for Audio-Visual Question Answering

URL: http://arxiv.org/abs/2503.04459v3
Date: Wed, 11 Jun 2025 12:30:39 GMT
Title: Question-Aware Gaussian Experts for Audio-Visual Question Answering
Authors: Hongyeob Kim, Inyoung Jung, Dayoon Suh, Youjia Zhang, Sangmin Lee, Sungeun Hong,
Abstract summary: Audio-Visual Question Answering (AVQA) requires question-based multimodal reasoning and precise temporal grounding.<n>This paper proposes QA-TIGER, a novel framework that explicitly incorporates question information and models continuous temporal dynamics.
Score: 8.377705744753047
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Audio-Visual Question Answering (AVQA) requires not only question-based multimodal reasoning but also precise temporal grounding to capture subtle dynamics for accurate prediction. However, existing methods mainly use question information implicitly, limiting focus on question-specific details. Furthermore, most studies rely on uniform frame sampling, which can miss key question-relevant frames. Although recent Top-K frame selection methods aim to address this, their discrete nature still overlooks fine-grained temporal details. This paper proposes QA-TIGER, a novel framework that explicitly incorporates question information and models continuous temporal dynamics. Our key idea is to use Gaussian-based modeling to adaptively focus on both consecutive and non-consecutive frames based on the question, while explicitly injecting question information and applying progressive refinement. We leverage a Mixture of Experts (MoE) to flexibly implement multiple Gaussian models, activating temporal experts specifically tailored to the question. Extensive experiments on multiple AVQA benchmarks show that QA-TIGER consistently achieves state-of-the-art performance. Code is available at https://aim-skku.github.io/QA-TIGER/

Related papers

QuAnTS: Question Answering on Time Series [50.91478742616324]
We propose a novel time series QA dataset, QuAnTS, for Question Answering on Time Series data.<n>We pose a wide variety of questions and answers about human motion in the form of tracked skeleton trajectories.<n>We verify that the large-scale QuAnTS dataset is well-formed and comprehensive through extensive experiments.
arXiv Detail & Related papers (2025-11-07T10:07:03Z)
Multi-hop Question Answering under Temporal Knowledge Editing [9.356343796845662]
Multi-hop question answering (MQA) under knowledge editing (KE) has garnered significant attention in the era of large language models. Existing models for MQA under KE exhibit poor performance when dealing with questions containing explicit temporal contexts. We propose TEMPoral knowLEdge augmented Multi-hop Question Answering (TEMPLE-MQA) to address this limitation.
arXiv Detail & Related papers (2024-03-30T23:22:51Z)
Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering [11.244643114253773]
Video Question (VideoQA) aims to answer natural language questions based on the information observed in videos. We propose a novel weakly supervised framework to enforce the LMMs to reason out the answers with question-critical moments as visual inputs.
arXiv Detail & Related papers (2024-01-19T14:21:46Z)
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA. We first augment the existing data via deliberate perturbations on either the image or question. We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z)
Locate before Answering: Answer Guided Question Localization for Video Question Answering [70.38700123685143]
LocAns integrates a question locator and an answer predictor into an end-to-end model. It achieves state-of-the-art performance on two modern long-term VideoQA datasets.
arXiv Detail & Related papers (2022-10-05T08:19:16Z)
Learning to Answer Questions in Dynamic Audio-Visual Scenarios [81.19017026999218]
We focus on the Audio-Visual Questioning (AVQA) task, which aims to answer questions regarding different visual objects sounds, and their associations in videos. Our dataset contains more than 45K question-answer pairs spanning over different modalities and question types. Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-SIC, V-SIC, and AVQA approaches.
arXiv Detail & Related papers (2022-03-26T13:03:42Z)
NAAQA: A Neural Architecture for Acoustic Question Answering [8.364707318181193]
The goal of the Acoustic Question Answering (AQA) task is to answer a free-form text question about the content of an acoustic scene. We propose a new benchmark for AQA, namely CLEAR2, that emphasizes the specific challenges of acoustic inputs. We also introduce NAAQA, a neural architecture that leverages specific properties of acoustic inputs.
arXiv Detail & Related papers (2021-06-11T03:05:48Z)
NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions [80.60423934589515]
We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark. We set up multi-choice and open-ended QA tasks targeting causal action reasoning, temporal action reasoning, and common scene comprehension. We find that top-performing methods excel at shallow scene descriptions but are weak in causal and temporal action reasoning.
arXiv Detail & Related papers (2021-05-18T04:56:46Z)
Few-Shot Question Answering by Pretraining Span Selection [58.31911597824848]
We explore the more realistic few-shot setting, where only a few hundred training examples are available. We show that standard span selection models perform poorly, highlighting the fact that current pretraining objective are far removed from question answering. Our findings indicate that careful design of pretraining schemes and model architecture can have a dramatic effect on performance in the few-shot settings.
arXiv Detail & Related papers (2021-01-02T11:58:44Z)
ClarQ: A large-scale and diverse dataset for Clarification Question Generation [67.1162903046619]
We devise a novel bootstrapping framework that assists in the creation of a diverse, large-scale dataset of clarification questions based on postcomments extracted from stackexchange. We quantitatively demonstrate the utility of the newly created dataset by applying it to the downstream task of question-answering. We release this dataset in order to foster research into the field of clarification question generation with the larger goal of enhancing dialog and question answering systems.
arXiv Detail & Related papers (2020-06-10T17:56:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.