Related papers: Dynamic Inter-Class Confusion-Aware Encoder for Audio-Visual Fusion in Human Activity Recognition

Dynamic Inter-Class Confusion-Aware Encoder for Audio-Visual Fusion in Human Activity Recognition

URL: http://arxiv.org/abs/2507.09323v1
Date: Sat, 12 Jul 2025 15:45:36 GMT
Title: Dynamic Inter-Class Confusion-Aware Encoder for Audio-Visual Fusion in Human Activity Recognition
Authors: Kaixuan Cong, Yifan Wang, Rongkun Xue, Yuyang Jiang, Yiming Feng, Jing Yang,
Abstract summary: This paper proposes the Dynamic Inter-Class Confusion-Aware (DICCAE), an encoder that aligns audio-video representations at a fine-grained, category-level.<n>DICCAE addresses category confusion by dynamically adjusting the confusion loss based on inter-class confusion degrees.<n>We also introduce a novel training framework that incorporates both audio and video modalities, as well as their fusion.
Score: 6.814894552541548
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Humans do not understand individual events in isolation; rather, they generalize concepts within classes and compare them to others. Existing audio-video pre-training paradigms only focus on the alignment of the overall audio-video modalities, without considering the reinforcement of distinguishing easily confused classes through cognitive induction and contrast during training. This paper proposes the Dynamic Inter-Class Confusion-Aware Encoder (DICCAE), an encoder that aligns audio-video representations at a fine-grained, category-level. DICCAE addresses category confusion by dynamically adjusting the confusion loss based on inter-class confusion degrees, thereby enhancing the model's ability to distinguish between similar activities. To further extend the application of DICCAE, we also introduce a novel training framework that incorporates both audio and video modalities, as well as their fusion. To mitigate the scarcity of audio-video data in the human activity recognition task, we propose a cluster-guided audio-video self-supervised pre-training strategy for DICCAE. DICCAE achieves near state-of-the-art performance on the VGGSound dataset, with a top-1 accuracy of 65.5%. We further evaluate its feature representation quality through extensive ablation studies, validating the necessity of each module.

Related papers

Implicit Counterfactual Learning for Audio-Visual Segmentation [50.69377287012591]
We propose the implicit counterfactual framework (ICF) to achieve unbiased cross-modal understanding.<n>Due to the lack of semantics, heterogeneous representations may lead to erroneous matches.<n>We introduce the multi-granularity implicit text (MIT) involving video-, segment- and frame-level as the bridge to establish the modality-shared space.
arXiv Detail & Related papers (2025-07-28T11:46:35Z)
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
LINK: Adaptive Modality Interaction for Audio-Visual Video Parsing [26.2873961811614]
We introduce a Learning Interaction method for Non-aligned Knowledge (Link)<n>Link equilibrate the contributions of distinct modalities by dynamically adjusting their input during event prediction.<n>We leverage the semantic information of pseudo-labels as a priori knowledge to mitigate noise from other modalities.
arXiv Detail & Related papers (2024-12-30T11:23:15Z)
Label-anticipated Event Disentanglement for Audio-Visual Video Parsing [61.08434062821899]
We introduce a new decoding paradigm, underlinelabel sunderlineemunderlineantic-based underlineprojection (LEAP) LEAP works by iteratively projecting encoded latent features of audio/visual segments onto semantically independent label embeddings. To facilitate the LEAP paradigm, we propose a semantic-aware optimization strategy, which includes a novel audio-visual semantic similarity loss function.
arXiv Detail & Related papers (2024-07-11T01:57:08Z)
Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges. This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations. We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z)
CM-PIE: Cross-modal perception for interactive-enhanced audio-visual video parsing [23.85763377992709]
We propose a novel interactive-enhanced cross-modal perception method(CM-PIE), which can learn fine-grained features by applying a segment-based attention module. We show that our model offers improved parsing performance on the Look, Listen, and Parse dataset.
arXiv Detail & Related papers (2023-10-11T14:15:25Z)
Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer [22.846623384472377]
We introduce the encoder-prompt-decoder paradigm to decode localization from the fused audio-visual feature. Specifically, we first propose to construct Semantic-aware Audio Prompt (SAP) to help the visual foundation model focus on sounding objects. We develop a Correlation Adapter (ColA) to keep minimal training efforts as well as maintain adequate knowledge of the visual foundation model.
arXiv Detail & Related papers (2023-09-13T05:43:35Z)
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes. Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z)
Self-supervised Contrastive Learning for Audio-Visual Action Recognition [7.188231323934023]
The underlying correlation between audio and visual modalities can be utilized to learn supervised information for unlabeled videos. We propose an end-to-end self-supervised framework named Audio-Visual Contrastive Learning (A), to learn discriminative audio-visual representations for action recognition.
arXiv Detail & Related papers (2022-04-28T10:01:36Z)
Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform. We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio) Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.