Related papers: SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering

SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering

URL: http://arxiv.org/abs/2411.04933v3
Date: Mon, 11 Nov 2024 02:41:31 GMT
Title: SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering
Authors: Tianyu Yang, Yiyang Nan, Lisen Dai, Zhenwen Liang, Yapeng Tian, Xiangliang Zhang,
Abstract summary: We introduce the Source-aware Semantic Representation Network (SaSR-Net), a novel model designed for Audio-Visual Question Answering (AVQA) SaSR-Net utilizes source-wise learnable tokens to efficiently capture and align audio-visual elements with the corresponding question. Experiments on the Music-AVQA and AVQA-Yang datasets show that SaSR-Net outperforms state-of-the-art AVQA methods.
Score: 53.00674706030977
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Audio-Visual Question Answering (AVQA) is a challenging task that involves answering questions based on both auditory and visual information in videos. A significant challenge is interpreting complex multi-modal scenes, which include both visual objects and sound sources, and connecting them to the given question. In this paper, we introduce the Source-aware Semantic Representation Network (SaSR-Net), a novel model designed for AVQA. SaSR-Net utilizes source-wise learnable tokens to efficiently capture and align audio-visual elements with the corresponding question. It streamlines the fusion of audio and visual information using spatial and temporal attention mechanisms to identify answers in multi-modal scenes. Extensive experiments on the Music-AVQA and AVQA-Yang datasets show that SaSR-Net outperforms state-of-the-art AVQA methods.

Related papers

Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning [56.873534081386]
A new topic HIREST is presented, including video retrieval, moment retrieval, moment segmentation, and step-captioning. We propose a query-centric audio-visual cognition network to construct a reliable multi-modal representation for three tasks. This can cognize user-preferred content and thus attain a query-centric audio-visual representation for three tasks.
arXiv Detail & Related papers (2024-12-18T06:43:06Z)
Robust Audiovisual Speech Recognition Models with Mixture-of-Experts [67.75334989582709]
We introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for in-the-wild'' videos. We first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection. Experiments show our model achieves state-of-the-art results on three benchmarks.
arXiv Detail & Related papers (2024-09-19T00:08:28Z)
Boosting Audio Visual Question Answering via Key Semantic-Aware Cues [8.526720031181027]
The Audio Visual Question Answering (AVQA) task aims to answer questions related to various visual objects, sounds, and their interactions in videos. We propose a Temporal-Spatial Perception Model (TSPM), which aims to empower the model to perceive key visual and auditory cues related to the questions.
arXiv Detail & Related papers (2024-07-30T09:41:37Z)
CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering [6.719652962434731]
This paper proposes a new CLIP-powered target-aware single-stream (TASS) network for audio-visual question answering (AVQA) It consists of two key components: the target-aware spatial grounding module (TSG+) and the single-stream joint temporal grounding module (JTG)
arXiv Detail & Related papers (2024-05-13T03:25:15Z)
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios [69.94398424864595]
This paper focuses on the challenge of answering questions in scenarios composed of rich and complex dynamic audio-visual components. We introduce the CAT, which enhances Multimodal Large Language Models (MLLMs) in three ways. CAT is trained on a mixed multimodal dataset, allowing direct application in audio-visual scenarios.
arXiv Detail & Related papers (2024-03-07T16:31:02Z)
Object-aware Adaptive-Positivity Learning for Audio-Visual Question Answering [27.763940453394902]
This paper focuses on the Audio-Visual Question Answering (AVQA) task that aims to answer questions derived from untrimmed audible videos. To generate accurate answers, an AVQA model is expected to find the most informative audio-visual clues relevant to the given questions.
arXiv Detail & Related papers (2023-12-20T07:36:38Z)
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA [20.155816093525374]
Existing AVQA methods suffer from two major shortcomings. The proposed novel CAD network improves the overall performance over the state-of-the-art methods on average by 9.4%.
arXiv Detail & Related papers (2023-10-25T16:40:09Z)
AVATAR: Unconstrained Audiovisual Speech Recognition [75.17253531162608]
We propose a new sequence-to-sequence AudioVisual ASR TrAnsformeR (AVATAR) trained end-to-end from spectrograms and full-frame RGB. We demonstrate the contribution of the visual modality on the How2 AV-ASR benchmark, especially in the presence of simulated noise. We also create a new, real-world test bed for AV-ASR called VisSpeech, which demonstrates the contribution of the visual modality under challenging audio conditions.
arXiv Detail & Related papers (2022-06-15T17:33:19Z)
Learning to Answer Questions in Dynamic Audio-Visual Scenarios [81.19017026999218]
We focus on the Audio-Visual Questioning (AVQA) task, which aims to answer questions regarding different visual objects sounds, and their associations in videos. Our dataset contains more than 45K question-answer pairs spanning over different modalities and question types. Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-SIC, V-SIC, and AVQA approaches.
arXiv Detail & Related papers (2022-03-26T13:03:42Z)
AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information. We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.