Answering Diverse Questions via Text Attached with Key Audio-Visual
Clues
- URL: http://arxiv.org/abs/2403.06679v1
- Date: Mon, 11 Mar 2024 12:51:37 GMT
- Title: Answering Diverse Questions via Text Attached with Key Audio-Visual
Clues
- Authors: Qilang Ye and Zitong Yu and Xin Liu
- Abstract summary: We propose a framework for performing mutual correlation distillation (MCD) to aid question inference.
We evaluate the proposed method on two publicly available datasets containing multiple question-and-answer pairs.
- Score: 24.347420432207283
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Audio-visual question answering (AVQA) requires reference to video content
and auditory information, followed by correlating the question to predict the
most precise answer. Although mining deeper layers of audio-visual information
to interact with questions facilitates the multimodal fusion process, the
redundancy of audio-visual parameters tends to reduce the generalization of the
inference engine to multiple question-answer pairs in a single video. Indeed,
the natural heterogeneous relationship between audiovisuals and text makes the
perfect fusion challenging, to prevent high-level audio-visual semantics from
weakening the network's adaptability to diverse question types, we propose a
framework for performing mutual correlation distillation (MCD) to aid question
inference. MCD is divided into three main steps: 1) firstly, the residual
structure is utilized to enhance the audio-visual soft associations based on
self-attention, then key local audio-visual features relevant to the question
context are captured hierarchically by shared aggregators and coupled in the
form of clues with specific question vectors. 2) Secondly, knowledge
distillation is enforced to align audio-visual-text pairs in a shared latent
space to narrow the cross-modal semantic gap. 3) And finally, the audio-visual
dependencies are decoupled by discarding the decision-level integrations. We
evaluate the proposed method on two publicly available datasets containing
multiple question-and-answer pairs, i.e., Music-AVQA and AVQA. Experiments show
that our method outperforms other state-of-the-art methods, and one interesting
finding behind is that removing deep audio-visual features during inference can
effectively mitigate overfitting. The source code is released at
http://github.com/rikeilong/MCD-forAVQA.
Related papers
- SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering [53.00674706030977]
We introduce the Source-aware Semantic Representation Network (SaSR-Net), a novel model designed for Audio-Visual Question Answering (AVQA)
SaSR-Net utilizes source-wise learnable tokens to efficiently capture and align audio-visual elements with the corresponding question.
Experiments on the Music-AVQA and AVQA-Yang datasets show that SaSR-Net outperforms state-of-the-art AVQA methods.
arXiv Detail & Related papers (2024-11-07T18:12:49Z) - Boosting Audio Visual Question Answering via Key Semantic-Aware Cues [8.526720031181027]
The Audio Visual Question Answering (AVQA) task aims to answer questions related to various visual objects, sounds, and their interactions in videos.
We propose a Temporal-Spatial Perception Model (TSPM), which aims to empower the model to perceive key visual and auditory cues related to the questions.
arXiv Detail & Related papers (2024-07-30T09:41:37Z) - CAT: Enhancing Multimodal Large Language Model to Answer Questions in
Dynamic Audio-Visual Scenarios [69.94398424864595]
This paper focuses on the challenge of answering questions in scenarios composed of rich and complex dynamic audio-visual components.
We introduce the CAT, which enhances Multimodal Large Language Models (MLLMs) in three ways.
CAT is trained on a mixed multimodal dataset, allowing direct application in audio-visual scenarios.
arXiv Detail & Related papers (2024-03-07T16:31:02Z) - Object-aware Adaptive-Positivity Learning for Audio-Visual Question
Answering [27.763940453394902]
This paper focuses on the Audio-Visual Question Answering (AVQA) task that aims to answer questions derived from untrimmed audible videos.
To generate accurate answers, an AVQA model is expected to find the most informative audio-visual clues relevant to the given questions.
arXiv Detail & Related papers (2023-12-20T07:36:38Z) - QDFormer: Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition [47.103732403296654]
Multi-source semantic space can be represented as the Cartesian product of single-source sub-spaces.
We introduce a global-to-local quantization mechanism, which distills knowledge from stable global (clip-level) features into local (frame-level) ones.
Experiments demonstrate that our semantically decomposed audio representation significantly improves AVS performance.
arXiv Detail & Related papers (2023-09-29T20:48:44Z) - Multi-Scale Attention for Audio Question Answering [9.254814692650523]
Audio question answering (AQA) acting as a widely used proxy task to explore scene understanding.
Existing methods mostly extend the structures of visual question answering task to audio ones in a simple pattern.
We present a Multi-scale Window Attention Fusion Model (MWAFM) consisting of an asynchronous hybrid attention module and a multi-scale window attention module.
arXiv Detail & Related papers (2023-05-29T10:06:58Z) - Learning to Answer Questions in Dynamic Audio-Visual Scenarios [81.19017026999218]
We focus on the Audio-Visual Questioning (AVQA) task, which aims to answer questions regarding different visual objects sounds, and their associations in videos.
Our dataset contains more than 45K question-answer pairs spanning over different modalities and question types.
Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-SIC, V-SIC, and AVQA approaches.
arXiv Detail & Related papers (2022-03-26T13:03:42Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z) - AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information.
We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.