Audio-Adaptive Activity Recognition Across Video Domains
- URL: http://arxiv.org/abs/2203.14240v2
- Date: Tue, 29 Mar 2022 07:03:19 GMT
- Title: Audio-Adaptive Activity Recognition Across Video Domains
- Authors: Yunhua Zhang, Hazel Doughty, Ling Shao, Cees G. M. Snoek
- Abstract summary: We leverage activity sounds for domain adaptation as they have less variance across domains and can reliably indicate which activities are not happening.
We propose an audio-adaptive encoder and associated learning methods that discriminatively adjust the visual feature representation.
We also introduce the new task of actor shift, with a corresponding audio-visual dataset, to challenge our method with situations where the activity appearance changes dramatically.
- Score: 112.46638682143065
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper strives for activity recognition under domain shift, for example
caused by change of scenery or camera viewpoint. The leading approaches reduce
the shift in activity appearance by adversarial training and self-supervised
learning. Different from these vision-focused works we leverage activity sounds
for domain adaptation as they have less variance across domains and can
reliably indicate which activities are not happening. We propose an
audio-adaptive encoder and associated learning methods that discriminatively
adjust the visual feature representation as well as addressing shifts in the
semantic distribution. To further eliminate domain-specific features and
include domain-invariant activity sounds for recognition, an audio-infused
recognizer is proposed, which effectively models the cross-modal interaction
across domains. We also introduce the new task of actor shift, with a
corresponding audio-visual dataset, to challenge our method with situations
where the activity appearance changes dramatically. Experiments on this
dataset, EPIC-Kitchens and CharadesEgo show the effectiveness of our approach.
Related papers
- Integrating Audio Narrations to Strengthen Domain Generalization in Multimodal First-Person Action Recognition [28.49695567630899]
First-person activity recognition is rapidly growing due to the widespread use of wearable cameras.
We propose a framework that improves domain generalization by integrating motion, audio, and appearance features.
Our approach achieves state-of-the-art performance on the ARGO1M dataset, effectively generalizing across unseen scenarios and locations.
arXiv Detail & Related papers (2024-09-15T04:43:00Z) - CACE-Net: Co-guidance Attention and Contrastive Enhancement for Effective Audio-Visual Event Localization [11.525177542345215]
We introduce CACE-Net, which differs from most existing methods that solely use audio signals to guide visual information.
We propose an audio-visual co-guidance attention mechanism that allows for adaptive bi-directional cross-modal attentional guidance.
Experiments on the AVE dataset demonstrate that CACE-Net sets a new benchmark in the audio-visual event localization task.
arXiv Detail & Related papers (2024-08-04T07:48:12Z) - Label-anticipated Event Disentanglement for Audio-Visual Video Parsing [61.08434062821899]
We introduce a new decoding paradigm, underlinelabel sunderlineemunderlineantic-based underlineprojection (LEAP)
LEAP works by iteratively projecting encoded latent features of audio/visual segments onto semantically independent label embeddings.
To facilitate the LEAP paradigm, we propose a semantic-aware optimization strategy, which includes a novel audio-visual semantic similarity loss function.
arXiv Detail & Related papers (2024-07-11T01:57:08Z) - CM-PIE: Cross-modal perception for interactive-enhanced audio-visual
video parsing [23.85763377992709]
We propose a novel interactive-enhanced cross-modal perception method(CM-PIE), which can learn fine-grained features by applying a segment-based attention module.
We show that our model offers improved parsing performance on the Look, Listen, and Parse dataset.
arXiv Detail & Related papers (2023-10-11T14:15:25Z) - Cross-domain Voice Activity Detection with Self-Supervised
Representations [9.02236667251654]
Voice Activity Detection (VAD) aims at detecting speech segments on an audio signal.
Current state-of-the-art methods focus on training a neural network exploiting features directly contained in the acoustics.
We show that representations based on Self-Supervised Learning (SSL) can adapt well to different domains.
arXiv Detail & Related papers (2022-09-22T14:53:44Z) - Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations
in Instructional Videos [78.34818195786846]
We introduce the task of spatially localizing narrated interactions in videos.
Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations.
We propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training.
arXiv Detail & Related papers (2021-10-20T14:45:13Z) - Learning Cross-modal Contrastive Features for Video Domain Adaptation [138.75196499580804]
We propose a unified framework for video domain adaptation, which simultaneously regularizes cross-modal and cross-domain feature representations.
Specifically, we treat each modality in a domain as a view and leverage the contrastive learning technique with properly designed sampling strategies.
arXiv Detail & Related papers (2021-08-26T18:14:18Z) - AFAN: Augmented Feature Alignment Network for Cross-Domain Object
Detection [90.18752912204778]
Unsupervised domain adaptation for object detection is a challenging problem with many real-world applications.
We propose a novel augmented feature alignment network (AFAN) which integrates intermediate domain image generation and domain-adversarial training.
Our approach significantly outperforms the state-of-the-art methods on standard benchmarks for both similar and dissimilar domain adaptations.
arXiv Detail & Related papers (2021-06-10T05:01:20Z) - Domain and View-point Agnostic Hand Action Recognition [6.432798111887824]
We introduce a novel skeleton-based hand motion representation model that tackles this problem.
We demonstrate the performance of our proposed motion representation model both working for a single specific domain (intra-domain action classification) and working for different unseen domains (cross-domain action classification)
Our approach achieves comparable results to the state-of-the-art methods that are trained intra-domain.
arXiv Detail & Related papers (2021-03-03T10:32:36Z) - Off-Dynamics Reinforcement Learning: Training for Transfer with Domain
Classifiers [138.68213707587822]
We propose a simple, practical, and intuitive approach for domain adaptation in reinforcement learning.
We show that we can achieve this goal by compensating for the difference in dynamics by modifying the reward function.
Our approach is applicable to domains with continuous states and actions and does not require learning an explicit model of the dynamics.
arXiv Detail & Related papers (2020-06-24T17:47:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.