OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via
Audiovisual Temporal Context
- URL: http://arxiv.org/abs/2202.04947v2
- Date: Mon, 14 Feb 2022 15:30:49 GMT
- Title: OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via
Audiovisual Temporal Context
- Authors: Merey Ramazanova, Victor Escorcia, Fabian Caba Heilbron, Chen Zhao,
Bernard Ghanem
- Abstract summary: We take a deep look into the effectiveness of audio in detecting actions in egocentric videos.
We propose a transformer-based model to incorporate temporal audio-visual context.
Our approach achieves state-of-the-art performance on EPIC-KITCHENS-100.
- Score: 58.932717614439916
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Temporal action localization (TAL) is an important task extensively explored
and improved for third-person videos in recent years. Recent efforts have been
made to perform fine-grained temporal localization on first-person videos.
However, current TAL methods only use visual signals, neglecting the audio
modality that exists in most videos and that shows meaningful action
information in egocentric videos. In this work, we take a deep look into the
effectiveness of audio in detecting actions in egocentric videos and introduce
a simple-yet-effective approach via Observing, Watching, and Listening (OWL) to
leverage audio-visual information and context for egocentric TAL. For doing
that, we: 1) compare and study different strategies for where and how to fuse
the two modalities; 2) propose a transformer-based model to incorporate
temporal audio-visual context. Our experiments show that our approach achieves
state-of-the-art performance on EPIC-KITCHENS-100.
Related papers
- Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos [87.32349247938136]
Existing approaches implicitly assume total correspondence between the video and audio during training.
We propose a novel ambient-aware audio generation model, AV-LDM.
Our approach is the first to focus video-to-audio generation faithfully on the observed visual content.
arXiv Detail & Related papers (2024-06-13T16:10:19Z) - SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos [77.55518265996312]
We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos.
Our multimodal contrastive-consensus coding (MC3) embedding reinforces the associations between audio, language, and vision when all modality pairs agree.
arXiv Detail & Related papers (2024-04-08T05:19:28Z) - Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos [69.79632907349489]
We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos.
Our method uses a masked auto-encoding framework to synthesize masked (multi-channel) audio through the synergy of audio and vision.
arXiv Detail & Related papers (2023-07-10T17:58:17Z) - Egocentric Audio-Visual Object Localization [51.434212424829525]
We propose a geometry-aware temporal aggregation module to handle the egomotion explicitly.
The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations.
It improves cross-modal localization robustness by disentangling visually-indicated audio representation.
arXiv Detail & Related papers (2023-03-23T17:43:11Z) - AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information.
We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.