Boosting Audio Visual Question Answering via Key Semantic-Aware Cues
- URL: http://arxiv.org/abs/2407.20693v1
- Date: Tue, 30 Jul 2024 09:41:37 GMT
- Title: Boosting Audio Visual Question Answering via Key Semantic-Aware Cues
- Authors: Guangyao Li, Henghui Du, Di Hu,
- Abstract summary: The Audio Visual Question Answering (AVQA) task aims to answer questions related to various visual objects, sounds, and their interactions in videos.
We propose a Temporal-Spatial Perception Model (TSPM), which aims to empower the model to perceive key visual and auditory cues related to the questions.
- Score: 8.526720031181027
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The Audio Visual Question Answering (AVQA) task aims to answer questions related to various visual objects, sounds, and their interactions in videos. Such naturally multimodal videos contain rich and complex dynamic audio-visual components, with only a portion of them closely related to the given questions. Hence, effectively perceiving audio-visual cues relevant to the given questions is crucial for correctly answering them. In this paper, we propose a Temporal-Spatial Perception Model (TSPM), which aims to empower the model to perceive key visual and auditory cues related to the questions. Specifically, considering the challenge of aligning non-declarative questions and visual representations into the same semantic space using visual-language pretrained models, we construct declarative sentence prompts derived from the question template, to assist the temporal perception module in better identifying critical segments relevant to the questions. Subsequently, a spatial perception module is designed to merge visual tokens from selected segments to highlight key latent targets, followed by cross-modal interaction with audio to perceive potential sound-aware areas. Finally, the significant temporal-spatial cues from these modules are integrated to answer the question. Extensive experiments on multiple AVQA benchmarks demonstrate that our framework excels not only in understanding audio-visual scenes but also in answering complex questions effectively. Code is available at https://github.com/GeWu-Lab/TSPM.
Related papers
- SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering [53.00674706030977]
We introduce the Source-aware Semantic Representation Network (SaSR-Net), a novel model designed for Audio-Visual Question Answering (AVQA)
SaSR-Net utilizes source-wise learnable tokens to efficiently capture and align audio-visual elements with the corresponding question.
Experiments on the Music-AVQA and AVQA-Yang datasets show that SaSR-Net outperforms state-of-the-art AVQA methods.
arXiv Detail & Related papers (2024-11-07T18:12:49Z) - KNN Transformer with Pyramid Prompts for Few-Shot Learning [52.735070934075736]
Few-Shot Learning aims to recognize new classes with limited labeled data.
Recent studies have attempted to address the challenge of rare samples with textual prompts to modulate visual features.
arXiv Detail & Related papers (2024-10-14T07:39:30Z) - Answering Diverse Questions via Text Attached with Key Audio-Visual
Clues [24.347420432207283]
We propose a framework for performing mutual correlation distillation (MCD) to aid question inference.
We evaluate the proposed method on two publicly available datasets containing multiple question-and-answer pairs.
arXiv Detail & Related papers (2024-03-11T12:51:37Z) - Object-aware Adaptive-Positivity Learning for Audio-Visual Question
Answering [27.763940453394902]
This paper focuses on the Audio-Visual Question Answering (AVQA) task that aims to answer questions derived from untrimmed audible videos.
To generate accurate answers, an AVQA model is expected to find the most informative audio-visual clues relevant to the given questions.
arXiv Detail & Related papers (2023-12-20T07:36:38Z) - Progressive Spatio-temporal Perception for Audio-Visual Question
Answering [9.727492401851478]
Audio-Visual Question Answering (AVQA) task aims to answer questions about different visual objects, sounds, and their associations in videos.
We propose a Progressive Spatio-Temporal Perception Network (PSTP-Net), which contains three modules that progressively identify key-temporal regions.
arXiv Detail & Related papers (2023-08-10T08:29:36Z) - Visual Causal Scene Refinement for Video Question Answering [117.08431221482638]
We present a causal analysis of VideoQA and propose a framework for cross-modal causal reasoning, named Visual Causal Scene Refinement (VCSR)
Our VCSR involves two essential modules, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention.
Experiments on the NExT-QA, Causal-VidQA, and MSRVTT-QA datasets demonstrate the superiority of our VCSR in discovering visual causal scene and achieving robust video question answering.
arXiv Detail & Related papers (2023-05-07T09:05:19Z) - Locate before Answering: Answer Guided Question Localization for Video
Question Answering [70.38700123685143]
LocAns integrates a question locator and an answer predictor into an end-to-end model.
It achieves state-of-the-art performance on two modern long-term VideoQA datasets.
arXiv Detail & Related papers (2022-10-05T08:19:16Z) - Learning to Answer Questions in Dynamic Audio-Visual Scenarios [81.19017026999218]
We focus on the Audio-Visual Questioning (AVQA) task, which aims to answer questions regarding different visual objects sounds, and their associations in videos.
Our dataset contains more than 45K question-answer pairs spanning over different modalities and question types.
Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-SIC, V-SIC, and AVQA approaches.
arXiv Detail & Related papers (2022-03-26T13:03:42Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.