Related papers: Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised Audio-Visual Video Parsing

Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised Audio-Visual Video Parsing

URL: http://arxiv.org/abs/2307.02041v1
Date: Wed, 5 Jul 2023 05:55:10 GMT
Title: Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised Audio-Visual Video Parsing
Authors: Jie Fu, Junyu Gao, Changsheng Xu
Abstract summary: Weakly-separated audio-visual video parsing (WS-AVVP) aims to localize the temporal extents of audio, visual and audio-visual event instances. WS-AVVP aims to identify the corresponding event categories with only video-level category labels for training.
Score: 107.031903351176
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Weakly-supervised audio-visual video parsing (WS-AVVP) aims to localize the temporal extents of audio, visual and audio-visual event instances as well as identify the corresponding event categories with only video-level category labels for training. Most previous methods pay much attention to refining the supervision for each modality or extracting fruitful cross-modality information for more reliable feature learning. None of them have noticed the imbalanced feature learning between different modalities in the task. In this paper, to balance the feature learning processes of different modalities, a dynamic gradient modulation (DGM) mechanism is explored, where a novel and effective metric function is designed to measure the imbalanced feature learning between audio and visual modalities. Furthermore, principle analysis indicates that the multimodal confusing calculation will hamper the precise measurement of multimodal imbalanced feature learning, which further weakens the effectiveness of our DGM mechanism. To cope with this issue, a modality-separated decision unit (MSDU) is designed for more precise measurement of imbalanced feature learning between audio and visual modalities. Comprehensive experiments are conducted on public benchmarks and the corresponding experimental results demonstrate the effectiveness of our proposed method.

Related papers

Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models [13.887164304514101]
The goal of this work is to enhance balanced multimodal understanding in audio-visual large language models (AV-LLMs)<n>In current AV-LLMs, audio and video features are typically processed jointly in the decoder.<n>We propose Fork-Merge Decoding (FMD), a simple yet effective inference-time strategy that requires no additional training or architectural modifications.
arXiv Detail & Related papers (2025-05-27T08:22:56Z)
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection. Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
Robust Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning [24.671771440617288]
We propose a new Robust Disentangled Counterfactual Learning (RDCL) approach for physical audiovisual commonsense reasoning. The main challenge is how to imitate the reasoning ability of humans, even under the scenario of missing modalities. Our proposed method is a plug-and-play module that can be incorporated into any baseline including VLMs.
arXiv Detail & Related papers (2025-02-18T01:49:45Z)
Discrepancy-Aware Attention Network for Enhanced Audio-Visual Zero-Shot Learning [1.8175282137722093]
We propose a Discrepancy-Aware Attention Network (DAAN) for Enhanced Audio-Visual ZSL. Our approach introduces a Quality-Discrepancy Attention (QDMA) unit to minimize redundant information in the high-quality modality. Experiments demonstrate DAAN state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2024-12-16T12:35:56Z)
Learning to Unify Audio, Visual and Text for Audio-Enhanced Multilingual Visual Answer Localization [4.062872727927056]
The goal of Multilingual Visual Answer localization (MVAL) is to locate a video segment that answers a given multilingual question. Existing methods either focus solely on visual modality or integrate visual and subtitle modalities. We propose a unified Audio-Visual-Textual Span localization (AVTSL) method that incorporates audio modality to augment both visual and textual representations.
arXiv Detail & Related papers (2024-11-05T06:49:14Z)
A contrastive-learning approach for auditory attention detection [11.28441753596964]
We propose a method based on self supervised learning to minimize the difference between the latent representations of an attended speech signal and the corresponding EEG signal. We compare our results with previously published methods and achieve state-of-the-art performance on the validation set.
arXiv Detail & Related papers (2024-10-24T03:13:53Z)
On-the-fly Modulation for Balanced Multimodal Learning [53.616094855778954]
Multimodal learning is expected to boost model performance by integrating information from different modalities. The widely-used joint training strategy leads to imbalanced and under-optimized uni-modal representations. We propose On-the-fly Prediction Modulation (OPM) and On-the-fly Gradient Modulation (OGM) strategies to modulate the optimization of each modality.
arXiv Detail & Related papers (2024-10-15T13:15:50Z)
An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition [49.45660055499103]
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training. Previous research has focused on aligning sequences' visual and semantic spatial distributions. We introduce a new loss function sampling method to obtain a tight and robust representation.
arXiv Detail & Related papers (2024-06-02T06:53:01Z)
Modeling Output-Level Task Relatedness in Multi-Task Learning with Feedback Mechanism [7.479892725446205]
Multi-task learning (MTL) is a paradigm that simultaneously learns multiple tasks by sharing information at different levels. We introduce a posteriori information into the model, considering that different tasks may produce correlated outputs with mutual influences. We achieve this by incorporating a feedback mechanism into MTL models, where the output of one task serves as a hidden feature for another task.
arXiv Detail & Related papers (2024-04-01T03:27:34Z)
A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition [53.800937914403654]
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames. While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input. We propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality.
arXiv Detail & Related papers (2024-03-07T06:06:55Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification. We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned. Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z)
Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking [18.225204270240734]
We propose a novel Multi-modal Perception Tracker (MPT) for speaker tracking using both audio and visual modalities. MPT achieves 98.6% and 78.3% tracking accuracy on the standard and occluded datasets, respectively.
arXiv Detail & Related papers (2021-12-14T14:14:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.