Related papers: Learning Trimodal Relation for AVQA with Missing Modality

Learning Trimodal Relation for AVQA with Missing Modality

URL: http://arxiv.org/abs/2407.16171v1
Date: Tue, 23 Jul 2024 04:35:56 GMT
Title: Learning Trimodal Relation for AVQA with Missing Modality
Authors: Kyu Ri Park, Hong Joo Lee, Jung Uk Kim,
Abstract summary: We propose a framework that ensures robust Audio-Visual Question Answering (AVQA) performance even when a modality is missing. Our method can provide accurate answers by effectively utilizing available information even when input modalities are missing.
Score: 13.705369273831055
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent Audio-Visual Question Answering (AVQA) methods rely on complete visual and audio input to answer questions accurately. However, in real-world scenarios, issues such as device malfunctions and data transmission errors frequently result in missing audio or visual modality. In such cases, existing AVQA methods suffer significant performance degradation. In this paper, we propose a framework that ensures robust AVQA performance even when a modality is missing. First, we propose a Relation-aware Missing Modal (RMM) generator with Relation-aware Missing Modal Recalling (RMMR) loss to enhance the ability of the generator to recall missing modal information by understanding the relationships and context among the available modalities. Second, we design an Audio-Visual Relation-aware (AVR) diffusion model with Audio-Visual Enhancing (AVE) loss to further enhance audio-visual features by leveraging the relationships and shared cues between the audio-visual modalities. As a result, our method can provide accurate answers by effectively utilizing available information even when input modalities are missing. We believe our method holds potential applications not only in AVQA research but also in various multi-modal scenarios.

Related papers

Query-Guided Spatial-Temporal-Frequency Interaction for Music Audio-Visual Question Answering [13.757806950813995]
Audio--Visual Question Answering (AVQA) is a challenging multimodal task that requires jointly reasoning over audio, visual, and textual information in a given video to answer natural language questions.<n>We propose a novel Query-guided Spatial--Temporal--Frequency interaction method to enhance audio--visual understanding.<n>Our proposed method achieves significant performance improvements over existing Audio QA, Visual QA, Video QA, and AVQA approaches.
arXiv Detail & Related papers (2026-01-27T17:24:32Z)
ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering [54.72902502486611]
ReAG is a Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages.<n>ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence.
arXiv Detail & Related papers (2025-11-27T19:01:02Z)
Audio Does Matter: Importance-Aware Multi-Granularity Fusion for Video Moment Retrieval [33.114796739109075]
Video Moment Retrieval (VMR) aims to retrieve a specific moment semantically related to a given query.<n>Most existing VMR methods solely focus on the visual and textual modalities while neglecting the complementary but important audio modality.<n>We propose a novel Importance-aware Multi-Granularity fusion model (IMG), which learns to dynamically and selectively aggregate the audio-vision-text contexts for VMR.
arXiv Detail & Related papers (2025-08-06T09:58:43Z)
MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks [67.31276358668424]
We introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer.<n> AVHaystacks is an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task.<n>We propose a model-agnostic, multi-agent framework to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystack
arXiv Detail & Related papers (2025-06-08T06:34:29Z)
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection. Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering [53.00674706030977]
We introduce the Source-aware Semantic Representation Network (SaSR-Net), a novel model designed for Audio-Visual Question Answering (AVQA) SaSR-Net utilizes source-wise learnable tokens to efficiently capture and align audio-visual elements with the corresponding question. Experiments on the Music-AVQA and AVQA-Yang datasets show that SaSR-Net outperforms state-of-the-art AVQA methods.
arXiv Detail & Related papers (2024-11-07T18:12:49Z)
Listen Then See: Video Alignment with Speaker Attention [0.0]
Socially Intelligent Question Answering (SIQA) requires context understanding, temporal reasoning, and the integration of multimodal information. We introduce a cross-modal alignment and subsequent representation fusion approach that achieves state-of-the-art results. Our approach exhibits an improved ability to leverage the video modality by using the audio modality as a bridge with the language modality.
arXiv Detail & Related papers (2024-04-21T04:55:13Z)
Answering Diverse Questions via Text Attached with Key Audio-Visual Clues [24.347420432207283]
We propose a framework for performing mutual correlation distillation (MCD) to aid question inference. We evaluate the proposed method on two publicly available datasets containing multiple question-and-answer pairs.
arXiv Detail & Related papers (2024-03-11T12:51:37Z)
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios [69.94398424864595]
This paper focuses on the challenge of answering questions in scenarios composed of rich and complex dynamic audio-visual components. We introduce the CAT, which enhances Multimodal Large Language Models (MLLMs) in three ways. CAT is trained on a mixed multimodal dataset, allowing direct application in audio-visual scenarios.
arXiv Detail & Related papers (2024-03-07T16:31:02Z)
A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition [53.800937914403654]
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames. While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input. We propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality.
arXiv Detail & Related papers (2024-03-07T06:06:55Z)
Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning. We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z)
Learning to Answer Questions in Dynamic Audio-Visual Scenarios [81.19017026999218]
We focus on the Audio-Visual Questioning (AVQA) task, which aims to answer questions regarding different visual objects sounds, and their associations in videos. Our dataset contains more than 45K question-answer pairs spanning over different modalities and question types. Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-SIC, V-SIC, and AVQA approaches.
arXiv Detail & Related papers (2022-03-26T13:03:42Z)
Self-attention fusion for audiovisual emotion recognition with incomplete data [103.70855797025689]
We consider the problem of multimodal data analysis with a use case of audiovisual emotion recognition. We propose an architecture capable of learning from raw data and describe three variants of it with distinct modality fusion mechanisms.
arXiv Detail & Related papers (2022-01-26T18:04:29Z)
A Multi-View Approach To Audio-Visual Speaker Verification [38.9710777250597]
In this study, we explore audio-visual approaches to speaker verification. We report the lowest AV equal error rate (EER) of 0.7% on the VoxCeleb1 dataset. This new approach achieves 28% EER on VoxCeleb1 in the challenging testing condition of cross-modal verification.
arXiv Detail & Related papers (2021-02-11T22:29:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.