My View is the Best View: Procedure Learning from Egocentric Videos
- URL: http://arxiv.org/abs/2207.10883v1
- Date: Fri, 22 Jul 2022 05:28:11 GMT
- Title: My View is the Best View: Procedure Learning from Egocentric Videos
- Authors: Siddhant Bansal, Chetan Arora, C.V. Jawahar
- Abstract summary: Existing approaches commonly use third-person videos for learning the procedure.
We observe that videos obtained from first-person (egocentric) wearable cameras provide an unobstructed and clear view of the action.
We present a novel self-supervised Correspond and Cut framework for procedure learning.
- Score: 31.385646424154732
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Procedure learning involves identifying the key-steps and determining their
logical order to perform a task. Existing approaches commonly use third-person
videos for learning the procedure, making the manipulated object small in
appearance and often occluded by the actor, leading to significant errors. In
contrast, we observe that videos obtained from first-person (egocentric)
wearable cameras provide an unobstructed and clear view of the action. However,
procedure learning from egocentric videos is challenging because (a) the camera
view undergoes extreme changes due to the wearer's head motion, and (b) the
presence of unrelated frames due to the unconstrained nature of the videos. Due
to this, current state-of-the-art methods' assumptions that the actions occur
at approximately the same time and are of the same duration, do not hold.
Instead, we propose to use the signal provided by the temporal correspondences
between key-steps across videos. To this end, we present a novel
self-supervised Correspond and Cut (CnC) framework for procedure learning. CnC
identifies and utilizes the temporal correspondences between the key-steps
across multiple videos to learn the procedure. Our experiments show that CnC
outperforms the state-of-the-art on the benchmark ProceL and CrossTask datasets
by 5.2% and 6.3%, respectively. Furthermore, for procedure learning using
egocentric videos, we propose the EgoProceL dataset consisting of 62 hours of
videos captured by 130 subjects performing 16 tasks. The source code and the
dataset are available on the project page https://sid2697.github.io/egoprocel/.
Related papers
- Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment [53.12952107996463]
This work proposes a novel training framework for learning to localize temporal boundaries of procedure steps in training videos.
Motivated by the strong capabilities of Large Language Models (LLMs) in procedure understanding and text summarization, we first apply an LLM to filter out task-irrelevant information and summarize task-related procedure steps from narrations.
To further generate reliable pseudo-matching between the LLM-steps and the video for training, we propose the Multi-Pathway Text-Video Alignment (MPTVA) strategy.
arXiv Detail & Related papers (2024-09-22T18:40:55Z) - VideoCutLER: Surprisingly Simple Unsupervised Video Instance
Segmentation [87.13210748484217]
VideoCutLER is a simple method for unsupervised multi-instance video segmentation without using motion-based learning signals like optical flow or training on natural videos.
We show the first competitive unsupervised learning results on the challenging YouTubeVIS 2019 benchmark, achieving 50.7% APvideo50.
VideoCutLER can also serve as a strong pretrained model for supervised video instance segmentation tasks, exceeding DINO by 15.9% on YouTubeVIS 2019 in terms of APvideo.
arXiv Detail & Related papers (2023-08-28T17:10:12Z) - Time Does Tell: Self-Supervised Time-Tuning of Dense Image
Representations [79.87044240860466]
We propose a novel approach that incorporates temporal consistency in dense self-supervised learning.
Our approach, which we call time-tuning, starts from image-pretrained models and fine-tunes them with a novel self-supervised temporal-alignment clustering loss on unlabeled videos.
Time-tuning improves the state-of-the-art by 8-10% for unsupervised semantic segmentation on videos and matches it for images.
arXiv Detail & Related papers (2023-08-22T21:28:58Z) - Video alignment using unsupervised learning of local and global features [0.0]
We introduce an unsupervised method for alignment that uses global and local features of the frames.
In particular, we introduce effective features for each video frame by means of three machine vision tools: person detection, pose estimation, and VGG network.
The main advantage of our approach is that no training is required, which makes it applicable for any new type of action without any need to collect training samples for it.
arXiv Detail & Related papers (2023-04-13T22:20:54Z) - Learning Procedure-aware Video Representation from Instructional Videos
and Their Narrations [22.723309913388196]
We learn video representation that encodes both action steps and their temporal ordering, based on a large-scale dataset of web instructional videos and their narrations.
Our method jointly learns a video representation to encode individual step concepts, and a deep probabilistic model to capture both temporal dependencies and immense individual variations in the step ordering.
arXiv Detail & Related papers (2023-03-31T07:02:26Z) - Multi-Task Learning of Object State Changes from Uncurated Videos [55.60442251060871]
We learn to temporally localize object state changes by observing people interacting with objects in long uncurated web videos.
We show that our multi-task model achieves a relative improvement of 40% over the prior single-task methods.
We also test our method on long egocentric videos of the EPIC-KITCHENS and the Ego4D datasets in a zero-shot setup.
arXiv Detail & Related papers (2022-11-24T09:42:46Z) - Learning To Recognize Procedural Activities with Distant Supervision [96.58436002052466]
We consider the problem of classifying fine-grained, multi-step activities from long videos spanning up to several minutes.
Our method uses a language model to match noisy, automatically-transcribed speech from the video to step descriptions in the knowledge base.
arXiv Detail & Related papers (2022-01-26T15:06:28Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Exploring Relations in Untrimmed Videos for Self-Supervised Learning [17.670226952829506]
Existing self-supervised learning methods mainly rely on trimmed videos for model training.
We propose a novel self-supervised method, referred to as Exploring Relations in Untemporal Videos (ERUV)
ERUV is able to learn richer representations and it outperforms state-of-the-art self-supervised methods with significant margins.
arXiv Detail & Related papers (2020-08-06T15:29:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.