Look for the Change: Learning Object States and State-Modifying Actions
from Untrimmed Web Videos
- URL: http://arxiv.org/abs/2203.11637v1
- Date: Tue, 22 Mar 2022 11:45:10 GMT
- Title: Look for the Change: Learning Object States and State-Modifying Actions
from Untrimmed Web Videos
- Authors: Tom\'a\v{s} Sou\v{c}ek, Jean-Baptiste Alayrac, Antoine Miech, Ivan
Laptev, Josef Sivic
- Abstract summary: Human actions often induce changes of object states such as "cutting an apple" or "pouring coffee"
We develop a self-supervised model for jointly learning state-modifying actions together with the corresponding object states.
To cope with noisy uncurated training data, our model incorporates a noise adaptive weighting module supervised by a small number of annotated still images.
- Score: 55.60442251060871
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human actions often induce changes of object states such as "cutting an
apple", "cleaning shoes" or "pouring coffee". In this paper, we seek to
temporally localize object states (e.g. "empty" and "full" cup) together with
the corresponding state-modifying actions ("pouring coffee") in long uncurated
videos with minimal supervision. The contributions of this work are threefold.
First, we develop a self-supervised model for jointly learning state-modifying
actions together with the corresponding object states from an uncurated set of
videos from the Internet. The model is self-supervised by the causal ordering
signal, i.e. initial object state $\rightarrow$ manipulating action
$\rightarrow$ end state. Second, to cope with noisy uncurated training data,
our model incorporates a noise adaptive weighting module supervised by a small
number of annotated still images, that allows to efficiently filter out
irrelevant videos during training. Third, we collect a new dataset with more
than 2600 hours of video and 34 thousand changes of object states, and manually
annotate a part of this data to validate our approach. Our results demonstrate
substantial improvements over prior work in both action and object
state-recognition in video.
Related papers
- Rethinking Image-to-Video Adaptation: An Object-centric Perspective [61.833533295978484]
We propose a novel and efficient image-to-video adaptation strategy from the object-centric perspective.
Inspired by human perception, we integrate a proxy task of object discovery into image-to-video transfer learning.
arXiv Detail & Related papers (2024-07-09T13:58:10Z) - Temporally Consistent Object Editing in Videos using Extended Attention [9.605596668263173]
We propose a method to edit videos using a pre-trained inpainting image diffusion model.
We ensure that the edited information will be consistent across all the video frames.
arXiv Detail & Related papers (2024-06-01T02:31:16Z) - Multi-Task Learning of Object State Changes from Uncurated Videos [55.60442251060871]
We learn to temporally localize object state changes by observing people interacting with objects in long uncurated web videos.
We show that our multi-task model achieves a relative improvement of 40% over the prior single-task methods.
We also test our method on long egocentric videos of the EPIC-KITCHENS and the Ego4D datasets in a zero-shot setup.
arXiv Detail & Related papers (2022-11-24T09:42:46Z) - Is an Object-Centric Video Representation Beneficial for Transfer? [86.40870804449737]
We introduce a new object-centric video recognition model on a transformer architecture.
We show that the object-centric model outperforms prior video representations.
arXiv Detail & Related papers (2022-07-20T17:59:44Z) - Few-Shot Learning for Video Object Detection in a Transfer-Learning
Scheme [70.45901040613015]
We study the new problem of few-shot learning for video object detection.
We employ a transfer-learning framework to effectively train the video object detector on a large number of base-class objects and a few video clips of novel-class objects.
arXiv Detail & Related papers (2021-03-26T20:37:55Z) - Towards Improving Spatiotemporal Action Recognition in Videos [0.0]
Motivated by the latest state-of-the-art real-time object detector You Only Watch Once (YOWO), we aim to modify its structure to increase action detection precision.
We propose four novel approaches in attempts to improve YOWO and address the imbalanced class issue in videos.
arXiv Detail & Related papers (2020-12-15T05:21:50Z) - Enhancing Unsupervised Video Representation Learning by Decoupling the
Scene and the Motion [86.56202610716504]
Action categories are highly related with the scene where the action happens, making the model tend to degrade to a solution where only the scene information is encoded.
We propose to decouple the scene and the motion (DSM) with two simple operations, so that the model attention towards the motion information is better paid.
arXiv Detail & Related papers (2020-09-12T09:54:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.