Target-Aware Spatio-Temporal Reasoning via Answering Questions in
Dynamics Audio-Visual Scenarios
- URL: http://arxiv.org/abs/2305.12397v2
- Date: Fri, 8 Dec 2023 08:44:54 GMT
- Title: Target-Aware Spatio-Temporal Reasoning via Answering Questions in
Dynamics Audio-Visual Scenarios
- Authors: Yuanyuan Jiang and Jianqin Yin
- Abstract summary: This paper proposes a new target-aware joint-temporal grounding network for audio-visual question answering (AVQA)
It consists of two key components: target-aware spatial grounding module (TSG) and the single-stream joint audio-visual temporal grounding module (JTG)
The JTG incorporates audio-visual fusion and question-aware temporal grounding into one module with a simpler single-stream architecture.
- Score: 7.938379811969159
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-visual question answering (AVQA) is a challenging task that requires
multistep spatio-temporal reasoning over multimodal contexts. Recent works rely
on elaborate target-agnostic parsing of audio-visual scenes for spatial
grounding while mistreating audio and video as separate entities for temporal
grounding. This paper proposes a new target-aware joint spatio-temporal
grounding network for AVQA. It consists of two key components: the target-aware
spatial grounding module (TSG) and the single-stream joint audio-visual
temporal grounding module (JTG). The TSG can focus on audio-visual cues
relevant to the query subject by utilizing explicit semantics from the
question. Unlike previous two-stream temporal grounding modules that required
an additional audio-visual fusion module, JTG incorporates audio-visual fusion
and question-aware temporal grounding into one module with a simpler
single-stream architecture. The temporal synchronization between audio and
video in the JTG is facilitated by our proposed cross-modal synchrony loss
(CSL). Extensive experiments verified the effectiveness of our proposed method
over existing state-of-the-art methods.
Related papers
- Boosting Audio Visual Question Answering via Key Semantic-Aware Cues [8.526720031181027]
The Audio Visual Question Answering (AVQA) task aims to answer questions related to various visual objects, sounds, and their interactions in videos.
We propose a Temporal-Spatial Perception Model (TSPM), which aims to empower the model to perceive key visual and auditory cues related to the questions.
arXiv Detail & Related papers (2024-07-30T09:41:37Z) - CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering [6.719652962434731]
This paper proposes a new CLIP-powered target-aware single-stream (TASS) network for audio-visual question answering (AVQA)
It consists of two key components: the target-aware spatial grounding module (TSG+) and the single-stream joint temporal grounding module (JTG)
arXiv Detail & Related papers (2024-05-13T03:25:15Z) - EchoTrack: Auditory Referring Multi-Object Tracking for Autonomous Driving [64.58258341591929]
Auditory Referring Multi-Object Tracking (AR-MOT) is a challenging problem in autonomous driving.
We put forward EchoTrack, an end-to-end AR-MOT framework with dual-stream vision transformers.
We establish the first set of large-scale AR-MOT benchmarks.
arXiv Detail & Related papers (2024-02-28T12:50:16Z) - Progressive Spatio-temporal Perception for Audio-Visual Question
Answering [9.727492401851478]
Audio-Visual Question Answering (AVQA) task aims to answer questions about different visual objects, sounds, and their associations in videos.
We propose a Progressive Spatio-Temporal Perception Network (PSTP-Net), which contains three modules that progressively identify key-temporal regions.
arXiv Detail & Related papers (2023-08-10T08:29:36Z) - Locate before Answering: Answer Guided Question Localization for Video
Question Answering [70.38700123685143]
LocAns integrates a question locator and an answer predictor into an end-to-end model.
It achieves state-of-the-art performance on two modern long-term VideoQA datasets.
arXiv Detail & Related papers (2022-10-05T08:19:16Z) - Rethinking Audio-visual Synchronization for Active Speaker Detection [62.95962896690992]
Existing research on active speaker detection (ASD) does not agree on the definition of active speakers.
We propose a cross-modal contrastive learning strategy and apply positional encoding in attention modules for supervised ASD models to leverage the synchronization cue.
Experimental results suggest that our model can successfully detect unsynchronized speaking as not speaking, addressing the limitation of current models.
arXiv Detail & Related papers (2022-06-21T14:19:06Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction
Detection in Videos [91.29436920371003]
We propose a simple yet effective architecture named Spatial-Temporal HOI Detection (ST-HOI)
We use temporal information such as human and object trajectories, correctly-localized visual features, and spatial-temporal masking pose features.
We construct a new video HOI benchmark dubbed VidHOI where our proposed approach serves as a solid baseline.
arXiv Detail & Related papers (2021-05-25T07:54:35Z) - Semantic Audio-Visual Navigation [93.12180578267186]
We introduce semantic audio-visual navigation, where objects in the environment make sounds consistent with their semantic meaning.
We propose a transformer-based model to tackle this new semantic AudioGoal task.
Our method strongly outperforms existing audio-visual navigation methods by learning to associate semantic, acoustic, and visual cues.
arXiv Detail & Related papers (2020-12-21T18:59:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.