Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised
Audio-Visual Video Parsing
- URL: http://arxiv.org/abs/2307.02041v1
- Date: Wed, 5 Jul 2023 05:55:10 GMT
- Title: Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised
Audio-Visual Video Parsing
- Authors: Jie Fu, Junyu Gao, Changsheng Xu
- Abstract summary: Weakly-separated audio-visual video parsing (WS-AVVP) aims to localize the temporal extents of audio, visual and audio-visual event instances.
WS-AVVP aims to identify the corresponding event categories with only video-level category labels for training.
- Score: 107.031903351176
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly-supervised audio-visual video parsing (WS-AVVP) aims to localize the
temporal extents of audio, visual and audio-visual event instances as well as
identify the corresponding event categories with only video-level category
labels for training. Most previous methods pay much attention to refining the
supervision for each modality or extracting fruitful cross-modality information
for more reliable feature learning. None of them have noticed the imbalanced
feature learning between different modalities in the task. In this paper, to
balance the feature learning processes of different modalities, a dynamic
gradient modulation (DGM) mechanism is explored, where a novel and effective
metric function is designed to measure the imbalanced feature learning between
audio and visual modalities. Furthermore, principle analysis indicates that the
multimodal confusing calculation will hamper the precise measurement of
multimodal imbalanced feature learning, which further weakens the effectiveness
of our DGM mechanism. To cope with this issue, a modality-separated decision
unit (MSDU) is designed for more precise measurement of imbalanced feature
learning between audio and visual modalities. Comprehensive experiments are
conducted on public benchmarks and the corresponding experimental results
demonstrate the effectiveness of our proposed method.
Related papers
- Learning to Unify Audio, Visual and Text for Audio-Enhanced Multilingual Visual Answer Localization [4.062872727927056]
The goal of Multilingual Visual Answer localization (MVAL) is to locate a video segment that answers a given multilingual question.
Existing methods either focus solely on visual modality or integrate visual and subtitle modalities.
We propose a unified Audio-Visual-Textual Span localization (AVTSL) method that incorporates audio modality to augment both visual and textual representations.
arXiv Detail & Related papers (2024-11-05T06:49:14Z) - A contrastive-learning approach for auditory attention detection [11.28441753596964]
We propose a method based on self supervised learning to minimize the difference between the latent representations of an attended speech signal and the corresponding EEG signal.
We compare our results with previously published methods and achieve state-of-the-art performance on the validation set.
arXiv Detail & Related papers (2024-10-24T03:13:53Z) - On-the-fly Modulation for Balanced Multimodal Learning [53.616094855778954]
Multimodal learning is expected to boost model performance by integrating information from different modalities.
The widely-used joint training strategy leads to imbalanced and under-optimized uni-modal representations.
We propose On-the-fly Prediction Modulation (OPM) and On-the-fly Gradient Modulation (OGM) strategies to modulate the optimization of each modality.
arXiv Detail & Related papers (2024-10-15T13:15:50Z) - An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition [49.45660055499103]
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training.
Previous research has focused on aligning sequences' visual and semantic spatial distributions.
We introduce a new loss function sampling method to obtain a tight and robust representation.
arXiv Detail & Related papers (2024-06-02T06:53:01Z) - Modeling Output-Level Task Relatedness in Multi-Task Learning with Feedback Mechanism [7.479892725446205]
Multi-task learning (MTL) is a paradigm that simultaneously learns multiple tasks by sharing information at different levels.
We introduce a posteriori information into the model, considering that different tasks may produce correlated outputs with mutual influences.
We achieve this by incorporating a feedback mechanism into MTL models, where the output of one task serves as a hidden feature for another task.
arXiv Detail & Related papers (2024-04-01T03:27:34Z) - A Study of Dropout-Induced Modality Bias on Robustness to Missing Video
Frames for Audio-Visual Speech Recognition [53.800937914403654]
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames.
While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input.
We propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality.
arXiv Detail & Related papers (2024-03-07T06:06:55Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification.
We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned.
Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z) - Multi-Modal Perception Attention Network with Self-Supervised Learning
for Audio-Visual Speaker Tracking [18.225204270240734]
We propose a novel Multi-modal Perception Tracker (MPT) for speaker tracking using both audio and visual modalities.
MPT achieves 98.6% and 78.3% tracking accuracy on the standard and occluded datasets, respectively.
arXiv Detail & Related papers (2021-12-14T14:14:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.