Progressive Spatio-temporal Perception for Audio-Visual Question
Answering
- URL: http://arxiv.org/abs/2308.05421v1
- Date: Thu, 10 Aug 2023 08:29:36 GMT
- Title: Progressive Spatio-temporal Perception for Audio-Visual Question
Answering
- Authors: Guangyao Li, Wenxuan Hou, Di Hu
- Abstract summary: Audio-Visual Question Answering (AVQA) task aims to answer questions about different visual objects, sounds, and their associations in videos.
We propose a Progressive Spatio-Temporal Perception Network (PSTP-Net), which contains three modules that progressively identify key-temporal regions.
- Score: 9.727492401851478
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Audio-Visual Question Answering (AVQA) task aims to answer questions about
different visual objects, sounds, and their associations in videos. Such
naturally multi-modal videos are composed of rich and complex dynamic
audio-visual components, where most of which could be unrelated to the given
questions, or even play as interference in answering the content of interest.
Oppositely, only focusing on the question-aware audio-visual content could get
rid of influence, meanwhile enabling the model to answer more efficiently. In
this paper, we propose a Progressive Spatio-Temporal Perception Network
(PSTP-Net), which contains three modules that progressively identify key
spatio-temporal regions w.r.t. questions. Specifically, a temporal segment
selection module is first introduced to select the most relevant audio-visual
segments related to the given question. Then, a spatial region selection module
is utilized to choose the most relevant regions associated with the question
from the selected temporal segments. To further refine the selection of
features, an audio-guided visual attention module is employed to perceive the
association between auido and selected spatial regions. Finally, the
spatio-temporal features from these modules are integrated for answering the
question. Extensive experimental results on the public MUSIC-AVQA and AVQA
datasets provide compelling evidence of the effectiveness and efficiency of
PSTP-Net. Code is available at:
\href{https://github.com/GeWu-Lab/PSTP-Net}{https://github.com/GeWu-Lab/PSTP-Net}
Related papers
- SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering [53.00674706030977]
We introduce the Source-aware Semantic Representation Network (SaSR-Net), a novel model designed for Audio-Visual Question Answering (AVQA)
SaSR-Net utilizes source-wise learnable tokens to efficiently capture and align audio-visual elements with the corresponding question.
Experiments on the Music-AVQA and AVQA-Yang datasets show that SaSR-Net outperforms state-of-the-art AVQA methods.
arXiv Detail & Related papers (2024-11-07T18:12:49Z) - Boosting Audio Visual Question Answering via Key Semantic-Aware Cues [8.526720031181027]
The Audio Visual Question Answering (AVQA) task aims to answer questions related to various visual objects, sounds, and their interactions in videos.
We propose a Temporal-Spatial Perception Model (TSPM), which aims to empower the model to perceive key visual and auditory cues related to the questions.
arXiv Detail & Related papers (2024-07-30T09:41:37Z) - CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering [6.719652962434731]
This paper proposes a new CLIP-powered target-aware single-stream (TASS) network for audio-visual question answering (AVQA)
It consists of two key components: the target-aware spatial grounding module (TSG+) and the single-stream joint temporal grounding module (JTG)
arXiv Detail & Related papers (2024-05-13T03:25:15Z) - Answering Diverse Questions via Text Attached with Key Audio-Visual
Clues [24.347420432207283]
We propose a framework for performing mutual correlation distillation (MCD) to aid question inference.
We evaluate the proposed method on two publicly available datasets containing multiple question-and-answer pairs.
arXiv Detail & Related papers (2024-03-11T12:51:37Z) - CAT: Enhancing Multimodal Large Language Model to Answer Questions in
Dynamic Audio-Visual Scenarios [69.94398424864595]
This paper focuses on the challenge of answering questions in scenarios composed of rich and complex dynamic audio-visual components.
We introduce the CAT, which enhances Multimodal Large Language Models (MLLMs) in three ways.
CAT is trained on a mixed multimodal dataset, allowing direct application in audio-visual scenarios.
arXiv Detail & Related papers (2024-03-07T16:31:02Z) - Target-Aware Spatio-Temporal Reasoning via Answering Questions in
Dynamics Audio-Visual Scenarios [7.938379811969159]
This paper proposes a new target-aware joint-temporal grounding network for audio-visual question answering (AVQA)
It consists of two key components: target-aware spatial grounding module (TSG) and the single-stream joint audio-visual temporal grounding module (JTG)
The JTG incorporates audio-visual fusion and question-aware temporal grounding into one module with a simpler single-stream architecture.
arXiv Detail & Related papers (2023-05-21T08:21:36Z) - Locate before Answering: Answer Guided Question Localization for Video
Question Answering [70.38700123685143]
LocAns integrates a question locator and an answer predictor into an end-to-end model.
It achieves state-of-the-art performance on two modern long-term VideoQA datasets.
arXiv Detail & Related papers (2022-10-05T08:19:16Z) - Learning to Answer Questions in Dynamic Audio-Visual Scenarios [81.19017026999218]
We focus on the Audio-Visual Questioning (AVQA) task, which aims to answer questions regarding different visual objects sounds, and their associations in videos.
Our dataset contains more than 45K question-answer pairs spanning over different modalities and question types.
Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-SIC, V-SIC, and AVQA approaches.
arXiv Detail & Related papers (2022-03-26T13:03:42Z) - Relation-aware Video Reading Comprehension for Temporal Language
Grounding [67.5613853693704]
Temporal language grounding in videos aims to localize the temporal span relevant to the given query sentence.
This paper will formulate temporal language grounding into video reading comprehension and propose a Relation-aware Network (RaNet) to address it.
arXiv Detail & Related papers (2021-10-12T03:10:21Z) - Co-Saliency Spatio-Temporal Interaction Network for Person
Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos.
It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions.
Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.