Multi-Task Learning of Object State Changes from Uncurated Videos
- URL: http://arxiv.org/abs/2211.13500v1
- Date: Thu, 24 Nov 2022 09:42:46 GMT
- Title: Multi-Task Learning of Object State Changes from Uncurated Videos
- Authors: Tom\'a\v{s} Sou\v{c}ek and Jean-Baptiste Alayrac and Antoine Miech and
Ivan Laptev and Josef Sivic
- Abstract summary: We learn to temporally localize object state changes by observing people interacting with objects in long uncurated web videos.
We show that our multi-task model achieves a relative improvement of 40% over the prior single-task methods.
We also test our method on long egocentric videos of the EPIC-KITCHENS and the Ego4D datasets in a zero-shot setup.
- Score: 55.60442251060871
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We aim to learn to temporally localize object state changes and the
corresponding state-modifying actions by observing people interacting with
objects in long uncurated web videos. We introduce three principal
contributions. First, we explore alternative multi-task network architectures
and identify a model that enables efficient joint learning of multiple object
states and actions such as pouring water and pouring coffee. Second, we design
a multi-task self-supervised learning procedure that exploits different types
of constraints between objects and state-modifying actions enabling end-to-end
training of a model for temporal localization of object states and actions in
videos from only noisy video-level supervision. Third, we report results on the
large-scale ChangeIt and COIN datasets containing tens of thousands of long
(un)curated web videos depicting various interactions such as hole drilling,
cream whisking, or paper plane folding. We show that our multi-task model
achieves a relative improvement of 40% over the prior single-task methods and
significantly outperforms both image-based and video-based zero-shot models for
this problem. We also test our method on long egocentric videos of the
EPIC-KITCHENS and the Ego4D datasets in a zero-shot setup demonstrating the
robustness of our learned model.
Related papers
- Rethinking Image-to-Video Adaptation: An Object-centric Perspective [61.833533295978484]
We propose a novel and efficient image-to-video adaptation strategy from the object-centric perspective.
Inspired by human perception, we integrate a proxy task of object discovery into image-to-video transfer learning.
arXiv Detail & Related papers (2024-07-09T13:58:10Z) - MINOTAUR: Multi-task Video Grounding From Multimodal Queries [70.08973664126873]
We present a single, unified model for tackling query-based video understanding in long-form videos.
In particular, our model can address all three tasks of the Ego4D Episodic Memory benchmark.
arXiv Detail & Related papers (2023-02-16T04:00:03Z) - Learning State-Aware Visual Representations from Audible Interactions [39.08554113807464]
We propose a self-supervised algorithm to learn representations from egocentric video data.
We use audio signals to identify moments of likely interactions which are conducive to better learning.
We validate these contributions extensively on two large-scale egocentric datasets.
arXiv Detail & Related papers (2022-09-27T17:57:13Z) - SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video
Anomaly Detection [108.57862846523858]
We revisit the self-supervised multi-task learning framework, proposing several updates to the original method.
We modernize the 3D convolutional backbone by introducing multi-head self-attention modules.
In our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps.
arXiv Detail & Related papers (2022-07-16T19:25:41Z) - SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric
Action Recognition [35.4163266882568]
We introduce Self-Supervised Learning Over Sets (SOS) to pre-train a generic Objects In Contact (OIC) representation model.
Our OIC significantly boosts the performance of multiple state-of-the-art video classification models.
arXiv Detail & Related papers (2022-04-10T23:27:19Z) - Look for the Change: Learning Object States and State-Modifying Actions
from Untrimmed Web Videos [55.60442251060871]
Human actions often induce changes of object states such as "cutting an apple" or "pouring coffee"
We develop a self-supervised model for jointly learning state-modifying actions together with the corresponding object states.
To cope with noisy uncurated training data, our model incorporates a noise adaptive weighting module supervised by a small number of annotated still images.
arXiv Detail & Related papers (2022-03-22T11:45:10Z) - Anomaly Detection in Video via Self-Supervised and Multi-Task Learning [113.81927544121625]
Anomaly detection in video is a challenging computer vision problem.
In this paper, we approach anomalous event detection in video through self-supervised and multi-task learning at the object level.
arXiv Detail & Related papers (2020-11-15T10:21:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.