Learning Fine-grained View-Invariant Representations from Unpaired
Ego-Exo Videos via Temporal Alignment
- URL: http://arxiv.org/abs/2306.05526v2
- Date: Sat, 25 Nov 2023 21:46:50 GMT
- Title: Learning Fine-grained View-Invariant Representations from Unpaired
Ego-Exo Videos via Temporal Alignment
- Authors: Zihui Xue, Kristen Grauman
- Abstract summary: We propose to learn fine-grained action features that are invariant to the viewpoints by aligning egocentric and exocentric videos in time.
To this end, we propose AE2, a self-supervised embedding approach with two key designs.
For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context.
- Score: 71.16699226211504
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The egocentric and exocentric viewpoints of a human activity look
dramatically different, yet invariant representations to link them are
essential for many potential applications in robotics and augmented reality.
Prior work is limited to learning view-invariant features from paired
synchronized viewpoints. We relax that strong data assumption and propose to
learn fine-grained action features that are invariant to the viewpoints by
aligning egocentric and exocentric videos in time, even when not captured
simultaneously or in the same environment. To this end, we propose AE2, a
self-supervised embedding approach with two key designs: (1) an object-centric
encoder that explicitly focuses on regions corresponding to hands and active
objects; and (2) a contrastive-based alignment objective that leverages
temporally reversed frames as negative samples. For evaluation, we establish a
benchmark for fine-grained video understanding in the ego-exo context,
comprising four datasets -- including an ego tennis forehand dataset we
collected, along with dense per-frame labels we annotated for each dataset. On
the four datasets, our AE2 method strongly outperforms prior work in a variety
of fine-grained downstream tasks, both in regular and cross-view settings.
Related papers
- Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning [80.37314291927889]
We present EMBED, a method designed to transform exocentric video-language data for egocentric video representation learning.
Egocentric videos predominantly feature close-up hand-object interactions, whereas exocentric videos offer a broader perspective on human activities.
By applying both vision and language style transfer, our framework creates a new egocentric dataset.
arXiv Detail & Related papers (2024-08-07T06:10:45Z) - Put Myself in Your Shoes: Lifting the Egocentric Perspective from
Exocentric Videos [66.46812056962567]
Exocentric-to-egocentric cross-view translation aims to generate a first-person (egocentric) view of an actor based on a video recording that captures the actor from a third-person (exocentric) perspective.
We propose a generative framework called Exo2Ego that decouples the translation process into two stages: high-level structure transformation and a pixel-level hallucination.
arXiv Detail & Related papers (2024-03-11T01:00:00Z) - Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities
Using Web Instructional Videos [27.209391862016574]
We propose a novel benchmark for cross-view knowledge transfer of dense video captioning.
We adapt models from web instructional videos with exocentric views to an egocentric view.
arXiv Detail & Related papers (2023-11-28T02:51:13Z) - SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric
Action Recognition [35.4163266882568]
We introduce Self-Supervised Learning Over Sets (SOS) to pre-train a generic Objects In Contact (OIC) representation model.
Our OIC significantly boosts the performance of multiple state-of-the-art video classification models.
arXiv Detail & Related papers (2022-04-10T23:27:19Z) - Enhancing Egocentric 3D Pose Estimation with Third Person Views [37.9683439632693]
We propose a novel approach to enhance the 3D body pose estimation of a person computed from videos captured from a single wearable camera.
We introduce First2Third-Pose, a new paired synchronized dataset of nearly 2,000 videos depicting human activities captured from both first- and third-view perspectives.
Experimental results demonstrate that the joint multi-view embedded space learned with our dataset is useful to extract discriminatory features from arbitrary single-view egocentric videos.
arXiv Detail & Related papers (2022-01-06T11:42:01Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z) - HighlightMe: Detecting Highlights from Human-Centric Videos [52.84233165201391]
We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos.
We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions.
We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods.
arXiv Detail & Related papers (2021-10-05T01:18:15Z) - Benchmarking Unsupervised Object Representations for Video Sequences [111.81492107649889]
We compare the perceptual abilities of four object-centric approaches: ViMON, OP3, TBA and SCALOR.
Our results suggest that the architectures with unconstrained latent representations learn more powerful representations in terms of object detection, segmentation and tracking.
Our benchmark may provide fruitful guidance towards learning more robust object-centric video representations.
arXiv Detail & Related papers (2020-06-12T09:37:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.