Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised
Audio-Visual Video Parsing
- URL: http://arxiv.org/abs/2307.02041v1
- Date: Wed, 5 Jul 2023 05:55:10 GMT
- Title: Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised
Audio-Visual Video Parsing
- Authors: Jie Fu, Junyu Gao, Changsheng Xu
- Abstract summary: Weakly-separated audio-visual video parsing (WS-AVVP) aims to localize the temporal extents of audio, visual and audio-visual event instances.
WS-AVVP aims to identify the corresponding event categories with only video-level category labels for training.
- Score: 107.031903351176
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly-supervised audio-visual video parsing (WS-AVVP) aims to localize the
temporal extents of audio, visual and audio-visual event instances as well as
identify the corresponding event categories with only video-level category
labels for training. Most previous methods pay much attention to refining the
supervision for each modality or extracting fruitful cross-modality information
for more reliable feature learning. None of them have noticed the imbalanced
feature learning between different modalities in the task. In this paper, to
balance the feature learning processes of different modalities, a dynamic
gradient modulation (DGM) mechanism is explored, where a novel and effective
metric function is designed to measure the imbalanced feature learning between
audio and visual modalities. Furthermore, principle analysis indicates that the
multimodal confusing calculation will hamper the precise measurement of
multimodal imbalanced feature learning, which further weakens the effectiveness
of our DGM mechanism. To cope with this issue, a modality-separated decision
unit (MSDU) is designed for more precise measurement of imbalanced feature
learning between audio and visual modalities. Comprehensive experiments are
conducted on public benchmarks and the corresponding experimental results
demonstrate the effectiveness of our proposed method.
Related papers
- Robust Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning [24.671771440617288]
We propose a new Robust Disentangled Counterfactual Learning (RDCL) approach for physical audiovisual commonsense reasoning.
The main challenge is how to imitate the reasoning ability of humans, even under the scenario of missing modalities.
Our proposed method is a plug-and-play module that can be incorporated into any baseline including VLMs.
arXiv Detail & Related papers (2025-02-18T01:49:45Z) - Discrepancy-Aware Attention Network for Enhanced Audio-Visual Zero-Shot Learning [1.8175282137722093]
We propose a Discrepancy-Aware Attention Network (DAAN) for Enhanced Audio-Visual ZSL.
Our approach introduces a Quality-Discrepancy Attention (QDMA) unit to minimize redundant information in the high-quality modality.
Experiments demonstrate DAAN state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2024-12-16T12:35:56Z) - Learning to Unify Audio, Visual and Text for Audio-Enhanced Multilingual Visual Answer Localization [4.062872727927056]
The goal of Multilingual Visual Answer localization (MVAL) is to locate a video segment that answers a given multilingual question.
Existing methods either focus solely on visual modality or integrate visual and subtitle modalities.
We propose a unified Audio-Visual-Textual Span localization (AVTSL) method that incorporates audio modality to augment both visual and textual representations.
arXiv Detail & Related papers (2024-11-05T06:49:14Z) - On-the-fly Modulation for Balanced Multimodal Learning [53.616094855778954]
Multimodal learning is expected to boost model performance by integrating information from different modalities.
The widely-used joint training strategy leads to imbalanced and under-optimized uni-modal representations.
We propose On-the-fly Prediction Modulation (OPM) and On-the-fly Gradient Modulation (OGM) strategies to modulate the optimization of each modality.
arXiv Detail & Related papers (2024-10-15T13:15:50Z) - Audio-visual cross-modality knowledge transfer for machine learning-based in-situ monitoring in laser additive manufacturing [2.592307869002029]
This paper introduces a cross-modality knowledge transfer (CMKT) methodology for LAM in-situ monitoring.
Three CMKT methods are proposed: semantic alignment, fully supervised mapping, and semi-supervised mapping.
In a case study for LAM in-situ defect detection, the proposed CMKT methods were compared with multimodal audio-visual fusion.
arXiv Detail & Related papers (2024-08-09T19:06:38Z) - An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition [49.45660055499103]
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training.
Previous research has focused on aligning sequences' visual and semantic spatial distributions.
We introduce a new loss function sampling method to obtain a tight and robust representation.
arXiv Detail & Related papers (2024-06-02T06:53:01Z) - A Study of Dropout-Induced Modality Bias on Robustness to Missing Video
Frames for Audio-Visual Speech Recognition [53.800937914403654]
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames.
While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input.
We propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality.
arXiv Detail & Related papers (2024-03-07T06:06:55Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification.
We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned.
Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z) - Multi-Modal Perception Attention Network with Self-Supervised Learning
for Audio-Visual Speaker Tracking [18.225204270240734]
We propose a novel Multi-modal Perception Tracker (MPT) for speaker tracking using both audio and visual modalities.
MPT achieves 98.6% and 78.3% tracking accuracy on the standard and occluded datasets, respectively.
arXiv Detail & Related papers (2021-12-14T14:14:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.