Cross-Domain First Person Audio-Visual Action Recognition through
Relative Norm Alignment
- URL: http://arxiv.org/abs/2106.01689v1
- Date: Thu, 3 Jun 2021 08:46:43 GMT
- Title: Cross-Domain First Person Audio-Visual Action Recognition through
Relative Norm Alignment
- Authors: Mirco Planamente, Chiara Plizzari, Emanuele Alberti, Barbara Caputo
- Abstract summary: First person action recognition is an increasingly researched topic because of the growing popularity of wearable cameras.
This is bringing to light cross-domain issues that are yet to be addressed in this context.
We propose to leverage over the intrinsic complementary nature of audio-visual signals to learn a representation that works well on data seen during training.
- Score: 15.545769463854915
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: First person action recognition is an increasingly researched topic because
of the growing popularity of wearable cameras. This is bringing to light
cross-domain issues that are yet to be addressed in this context. Indeed, the
information extracted from learned representations suffers from an intrinsic
environmental bias. This strongly affects the ability to generalize to unseen
scenarios, limiting the application of current methods in real settings where
trimmed labeled data are not available during training. In this work, we
propose to leverage over the intrinsic complementary nature of audio-visual
signals to learn a representation that works well on data seen during training,
while being able to generalize across different domains. To this end, we
introduce an audio-visual loss that aligns the contributions from the two
modalities by acting on the magnitude of their feature norm representations.
This new loss, plugged into a minimal multi-modal action recognition
architecture, leads to strong results in cross-domain first person action
recognition, as demonstrated by extensive experiments on the popular
EPIC-Kitchens dataset.
Related papers
- Benchmarking Cross-Domain Audio-Visual Deception Detection [45.342156006617394]
We present the first cross-domain audio-visual deception detection benchmark.
We compare single-to-single and multi-to-single domain generalization performance.
We propose an algorithm to enhance the generalization performance.
arXiv Detail & Related papers (2024-05-11T12:06:31Z) - Disentangled Interaction Representation for One-Stage Human-Object
Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding.
Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction.
Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - Deepfake Detection via Joint Unsupervised Reconstruction and Supervised
Classification [25.84902508816679]
We introduce a novel approach for deepfake detection, which considers the reconstruction and classification tasks simultaneously.
This method shares the information learned by one task with the other, which focuses on a different aspect other existing works rarely consider.
Our method achieves state-of-the-art performance on three commonly-used datasets.
arXiv Detail & Related papers (2022-11-24T05:44:26Z) - Cross-domain Voice Activity Detection with Self-Supervised
Representations [9.02236667251654]
Voice Activity Detection (VAD) aims at detecting speech segments on an audio signal.
Current state-of-the-art methods focus on training a neural network exploiting features directly contained in the acoustics.
We show that representations based on Self-Supervised Learning (SSL) can adapt well to different domains.
arXiv Detail & Related papers (2022-09-22T14:53:44Z) - Adaptive Local-Component-aware Graph Convolutional Network for One-shot
Skeleton-based Action Recognition [54.23513799338309]
We present an Adaptive Local-Component-aware Graph Convolutional Network for skeleton-based action recognition.
Our method provides a stronger representation than the global embedding and helps our model reach state-of-the-art.
arXiv Detail & Related papers (2022-09-21T02:33:07Z) - Audio-Adaptive Activity Recognition Across Video Domains [112.46638682143065]
We leverage activity sounds for domain adaptation as they have less variance across domains and can reliably indicate which activities are not happening.
We propose an audio-adaptive encoder and associated learning methods that discriminatively adjust the visual feature representation.
We also introduce the new task of actor shift, with a corresponding audio-visual dataset, to challenge our method with situations where the activity appearance changes dramatically.
arXiv Detail & Related papers (2022-03-27T08:15:20Z) - Domain Generalization through Audio-Visual Relative Norm Alignment in
First Person Action Recognition [15.545769463854915]
First person action recognition is becoming an increasingly researched area thanks to the rising popularity of wearable cameras.
This is bringing to light cross-domain issues that are yet to be addressed in this context.
We introduce the first domain generalization approach for egocentric activity recognition, by proposing a new audio-visual loss.
arXiv Detail & Related papers (2021-10-19T16:52:39Z) - MD-CSDNetwork: Multi-Domain Cross Stitched Network for Deepfake
Detection [80.83725644958633]
Current deepfake generation methods leave discriminative artifacts in the frequency spectrum of fake images and videos.
We present a novel approach, termed as MD-CSDNetwork, for combining the features in the spatial and frequency domains to mine a shared discriminative representation.
arXiv Detail & Related papers (2021-09-15T14:11:53Z) - An audiovisual and contextual approach for categorical and continuous
emotion recognition in-the-wild [27.943550651941166]
We tackle the task of video-based audio-visual emotion recognition, within the premises of the 2nd Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW)
Standard methodologies that rely solely on the extraction of facial features often fall short of accurate emotion prediction in cases where the aforementioned source of affective information is inaccessible due to head/body orientation, low resolution and poor illumination.
We aspire to alleviate this problem by leveraging bodily as well as contextual features, as part of a broader emotion recognition framework.
arXiv Detail & Related papers (2021-07-07T20:13:17Z) - Phase Consistent Ecological Domain Adaptation [76.75730500201536]
We focus on the task of semantic segmentation, where annotated synthetic data are aplenty, but annotating real data is laborious.
The first criterion, inspired by visual psychophysics, is that the map between the two image domains be phase-preserving.
The second criterion aims to leverage ecological statistics, or regularities in the scene which are manifest in any image of it, regardless of the characteristics of the illuminant or the imaging sensor.
arXiv Detail & Related papers (2020-04-10T06:58:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.