Self-Supervision by Prediction for Object Discovery in Videos
- URL: http://arxiv.org/abs/2103.05669v1
- Date: Tue, 9 Mar 2021 19:14:33 GMT
- Title: Self-Supervision by Prediction for Object Discovery in Videos
- Authors: Beril Besbinar, Pascal Frossard
- Abstract summary: In this paper, we use the prediction task as self-supervision and build a novel object-centric model for image sequence representation.
Our framework can be trained without the help of any manual annotation or pretrained network.
Initial experiments confirm that the proposed pipeline is a promising step towards object-centric video prediction.
- Score: 62.87145010885044
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite their irresistible success, deep learning algorithms still heavily
rely on annotated data. On the other hand, unsupervised settings pose many
challenges, especially about determining the right inductive bias in diverse
scenarios. One scalable solution is to make the model generate the supervision
for itself by leveraging some part of the input data, which is known as
self-supervised learning. In this paper, we use the prediction task as
self-supervision and build a novel object-centric model for image sequence
representation. In addition to disentangling the notion of objects and the
motion dynamics, our compositional structure explicitly handles occlusion and
inpaints inferred objects and background for the composition of the predicted
frame. With the aid of auxiliary loss functions that promote spatially and
temporally consistent object representations, our self-supervised framework can
be trained without the help of any manual annotation or pretrained network.
Initial experiments confirm that the proposed pipeline is a promising step
towards object-centric video prediction.
Related papers
- Object-centric Video Representation for Long-term Action Anticipation [33.115854386196126]
Key motivation is that objects provide important cues to recognize and predict human-object interactions.
We propose to build object-centric video representations by leveraging visual-language pretrained models.
To recognize and predict human-object interactions, we use a Transformer-based neural architecture.
arXiv Detail & Related papers (2023-10-31T22:54:31Z) - Point Contrastive Prediction with Semantic Clustering for
Self-Supervised Learning on Point Cloud Videos [71.20376514273367]
We propose a unified point cloud video self-supervised learning framework for object-centric and scene-centric data.
Our method outperforms supervised counterparts on a wide range of downstream tasks.
arXiv Detail & Related papers (2023-08-18T02:17:47Z) - Does Visual Pretraining Help End-to-End Reasoning? [81.4707017038019]
We investigate whether end-to-end learning of visual reasoning can be achieved with general-purpose neural networks.
We propose a simple and general self-supervised framework which "compresses" each video frame into a small set of tokens.
We observe that pretraining is essential to achieve compositional generalization for end-to-end visual reasoning.
arXiv Detail & Related papers (2023-07-17T14:08:38Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - Learning Invariant World State Representations with Predictive Coding [1.8963850600275547]
We develop a new predictive coding-based architecture and a hybrid fully-supervised/self-supervised learning method.
We evaluate the robustness of our model on a new synthetic dataset.
arXiv Detail & Related papers (2022-07-06T21:08:30Z) - KINet: Unsupervised Forward Models for Robotic Pushing Manipulation [8.572983995175909]
We introduce KINet -- an unsupervised framework to reason about object interactions based on a keypoint representation.
Our model learns to associate objects with keypoint coordinates and discovers a graph representation of the system.
By learning to perform physical reasoning in the keypoint space, our model automatically generalizes to scenarios with a different number of objects.
arXiv Detail & Related papers (2022-02-18T03:32:08Z) - Learning Actor-centered Representations for Action Localization in
Streaming Videos using Predictive Learning [18.757368441841123]
Event perception tasks such as recognizing and localizing actions in streaming videos are essential for tackling visual understanding tasks.
We tackle the problem of learning textitactor-centered representations through the notion of continual hierarchical predictive learning.
Inspired by cognitive theories of event perception, we propose a novel, self-supervised framework.
arXiv Detail & Related papers (2021-04-29T06:06:58Z) - Progressive Self-Guided Loss for Salient Object Detection [102.35488902433896]
We present a progressive self-guided loss function to facilitate deep learning-based salient object detection in images.
Our framework takes advantage of adaptively aggregated multi-scale features to locate and detect salient objects effectively.
arXiv Detail & Related papers (2021-01-07T07:33:38Z) - Self-supervised Segmentation via Background Inpainting [96.10971980098196]
We introduce a self-supervised detection and segmentation approach that can work with single images captured by a potentially moving camera.
We exploit a self-supervised loss function that we exploit to train a proposal-based segmentation network.
We apply our method to human detection and segmentation in images that visually depart from those of standard benchmarks and outperform existing self-supervised methods.
arXiv Detail & Related papers (2020-11-11T08:34:40Z) - Motion Segmentation using Frequency Domain Transformer Networks [29.998917158604694]
We propose a novel end-to-end learnable architecture that predicts the next frame by modeling foreground and background separately.
Our approach can outperform some widely used video prediction methods like Video Ladder Network and Predictive Gated Pyramids on synthetic data.
arXiv Detail & Related papers (2020-04-18T15:05:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.