Learning State-Aware Visual Representations from Audible Interactions
- URL: http://arxiv.org/abs/2209.13583v1
- Date: Tue, 27 Sep 2022 17:57:13 GMT
- Title: Learning State-Aware Visual Representations from Audible Interactions
- Authors: Himangi Mittal, Pedro Morgado, Unnat Jain, Abhinav Gupta
- Abstract summary: We propose a self-supervised algorithm to learn representations from egocentric video data.
We use audio signals to identify moments of likely interactions which are conducive to better learning.
We validate these contributions extensively on two large-scale egocentric datasets.
- Score: 39.08554113807464
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a self-supervised algorithm to learn representations from
egocentric video data. Recently, significant efforts have been made to capture
humans interacting with their own environments as they go about their daily
activities. In result, several large egocentric datasets of interaction-rich
multi-modal data have emerged. However, learning representations from videos
can be challenging. First, given the uncurated nature of long-form continuous
videos, learning effective representations require focusing on moments in time
when interactions take place. Second, visual representations of daily
activities should be sensitive to changes in the state of the environment.
However, current successful multi-modal learning frameworks encourage
representation invariance over time. To address these challenges, we leverage
audio signals to identify moments of likely interactions which are conducive to
better learning. We also propose a novel self-supervised objective that learns
from audible state changes caused by interactions. We validate these
contributions extensively on two large-scale egocentric datasets,
EPIC-Kitchens-100 and the recently released Ego4D, and show improvements on
several downstream tasks, including action recognition, long-term action
anticipation, and object state change classification.
Related papers
- SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos [77.55518265996312]
We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos.
Our multimodal contrastive-consensus coding (MC3) embedding reinforces the associations between audio, language, and vision when all modality pairs agree.
arXiv Detail & Related papers (2024-04-08T05:19:28Z) - Towards Continual Egocentric Activity Recognition: A Multi-modal
Egocentric Activity Dataset for Continual Learning [21.68009790164824]
We present a multi-modal egocentric activity dataset for continual learning named UESTC-MMEA-CL.
It contains synchronized data of videos, accelerometers, and gyroscopes, for 32 types of daily activities, performed by 10 participants.
Results of egocentric activity recognition are reported when using separately, and jointly, three modalities: RGB, acceleration, and gyroscope.
arXiv Detail & Related papers (2023-01-26T04:32:00Z) - Multi-Task Learning of Object State Changes from Uncurated Videos [55.60442251060871]
We learn to temporally localize object state changes by observing people interacting with objects in long uncurated web videos.
We show that our multi-task model achieves a relative improvement of 40% over the prior single-task methods.
We also test our method on long egocentric videos of the EPIC-KITCHENS and the Ego4D datasets in a zero-shot setup.
arXiv Detail & Related papers (2022-11-24T09:42:46Z) - The Surprising Effectiveness of Representation Learning for Visual
Imitation [12.60653315718265]
We propose to decouple representation learning from behavior learning for visual imitation.
First, we learn a visual representation encoder from offline data using standard supervised and self-supervised learning methods.
We experimentally show that this simple decoupling improves the performance of visual imitation models on both offline demonstration datasets and real-robot door opening compared to prior work in visual imitation.
arXiv Detail & Related papers (2021-12-02T18:58:09Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z) - Learning Asynchronous and Sparse Human-Object Interaction in Videos [56.73059840294019]
Asynchronous-Sparse Interaction Graph Networks (ASSIGN) is able to automatically detect the structure of interaction events associated with entities in a video scene.
ASSIGN is tested on human-object interaction recognition and shows superior performance in segmenting and labeling of human sub-activities and object affordances from raw videos.
arXiv Detail & Related papers (2021-03-03T23:43:55Z) - Coarse Temporal Attention Network (CTA-Net) for Driver's Activity
Recognition [14.07119502083967]
Driver's activities are different since they are executed by the same subject with similar body parts movements, resulting in subtle changes.
Our model is named Coarse Temporal Attention Network (CTA-Net), in which coarse temporal branches are introduced in a trainable glimpse.
The model then uses an innovative attention mechanism to generate high-level action specific contextual information for activity recognition.
arXiv Detail & Related papers (2021-01-17T10:15:37Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.