Related papers: Integrating Audio Narrations to Strengthen Domain Generalization in Multimodal First-Person Action Recognition

Integrating Audio Narrations to Strengthen Domain Generalization in Multimodal First-Person Action Recognition

URL: http://arxiv.org/abs/2409.09611v1
Date: Sun, 15 Sep 2024 04:43:00 GMT
Title: Integrating Audio Narrations to Strengthen Domain Generalization in Multimodal First-Person Action Recognition
Authors: Cagri Gungor, Adriana Kovashka,
Abstract summary: First-person activity recognition is rapidly growing due to the widespread use of wearable cameras. We propose a framework that improves domain generalization by integrating motion, audio, and appearance features. Our approach achieves state-of-the-art performance on the ARGO1M dataset, effectively generalizing across unseen scenarios and locations.
Score: 28.49695567630899
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: First-person activity recognition is rapidly growing due to the widespread use of wearable cameras but faces challenges from domain shifts across different environments, such as varying objects or background scenes. We propose a multimodal framework that improves domain generalization by integrating motion, audio, and appearance features. Key contributions include analyzing the resilience of audio and motion features to domain shifts, using audio narrations for enhanced audio-text alignment, and applying consistency ratings between audio and visual narrations to optimize the impact of audio in recognition during training. Our approach achieves state-of-the-art performance on the ARGO1M dataset, effectively generalizing across unseen scenarios and locations.

Related papers

CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment [76.32508013503653]
We propose CAV-MAE Sync as a simple yet effective extension of the original CAV-MAE framework for self-supervised audio-visual learning.<n>We tackle the mismatch between modalities by treating audio as a temporal sequence aligned with video frames, rather than using global representations.<n>We improve spatial localization by introducing learnable register tokens that reduce semantic load on patch tokens.
arXiv Detail & Related papers (2025-05-02T12:59:58Z)
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection. Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment [26.399212357764576]
Accurately localizing audible objects based on audio-visual cues is the core objective of audio-visual segmentation. We propose a novel framework with two primary components: an audio-guided modality alignment (AMA) module and an uncertainty estimation (UE) module. AMA performs audio-visual interactions within multiple groups and consolidates group features into compact representations based on their responsiveness to audio cues. UE integrates spatial and temporal information to identify high-uncertainty regions caused by frequent changes in sound state.
arXiv Detail & Related papers (2025-03-17T05:48:22Z)
Enhancing Audiovisual Speech Recognition through Bifocal Preference Optimization [59.1277150358203]
We propose using a preference optimization strategy to improve speech recognition accuracy for real-world videos. First, we create preference data via simulating common errors that occurred in AV-ASR from two focals. Second, we propose BPO-AVASR, a Bifocal Preference Optimization method to improve AV-ASR models by leveraging both input-side and output-side preference.
arXiv Detail & Related papers (2024-12-26T00:26:45Z)
Cooperative Dual Attention for Audio-Visual Speech Enhancement with Facial Cues [80.53407593586411]
We focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE) We propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE.
arXiv Detail & Related papers (2023-11-24T04:30:31Z)
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes. Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z)
Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition [21.477900473255264]
We propose a cross-modal global interaction and local alignment (GILA) approach for audio-visual speech recognition (AVSR) Specifically, we design a global interaction model to capture the A-V complementary relationship on modality level, as well as a local alignment approach to model the A-V temporal consistency on frame level. Our GILA outperforms the supervised learning state-of-the-art on public benchmarks LRS3 and LRS2.
arXiv Detail & Related papers (2023-05-16T06:41:25Z)
Accommodating Audio Modality in CLIP for Multimodal Processing [48.83906067348211]
We extend the Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing. Specifically, we apply inter-modal and intra-modal contrastive learning to explore the correlation between audio and other modalities. Our proposed CLIP4VLA model is validated in different downstream tasks including video retrieval and video captioning.
arXiv Detail & Related papers (2023-03-12T06:57:01Z)
AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization [14.103742565510387]
We introduce AVE-CLIP, a novel framework that integrates the AudioCLIP pre-trained on large-scale audio-visual data with a multi-window temporal transformer. Our method achieves state-of-the-art performance on the publicly available AVE dataset with 5.9% mean accuracy improvement.
arXiv Detail & Related papers (2022-10-11T00:15:45Z)
Audio-Adaptive Activity Recognition Across Video Domains [112.46638682143065]
We leverage activity sounds for domain adaptation as they have less variance across domains and can reliably indicate which activities are not happening. We propose an audio-adaptive encoder and associated learning methods that discriminatively adjust the visual feature representation. We also introduce the new task of actor shift, with a corresponding audio-visual dataset, to challenge our method with situations where the activity appearance changes dramatically.
arXiv Detail & Related papers (2022-03-27T08:15:20Z)
Space-Time Memory Network for Sounding Object Localization in Videos [40.45443192327351]
We propose a space-time memory network for sounding object localization in videos. It can simultaneously learn uni-temporal attention over both uni-temporal and cross-modal representations.
arXiv Detail & Related papers (2021-11-10T04:40:12Z)
Joint Learning of Visual-Audio Saliency Prediction and Sound Source Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video. The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z)
Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization. This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions. A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.