Related papers: Self-Supervised Representation Learning from Flow Equivariance

Self-Supervised Representation Learning from Flow Equivariance

URL: http://arxiv.org/abs/2101.06553v1
Date: Sat, 16 Jan 2021 23:44:09 GMT
Title: Self-Supervised Representation Learning from Flow Equivariance
Authors: Yuwen Xiong, Mengye Ren, Wenyuan Zeng, Raquel Urtasun
Abstract summary: We present a new self-supervised learning representation framework that can be directly deployed on a video stream of complex scenes. Our representations, learned from high-resolution raw video, can be readily used for downstream tasks on static images.
Score: 97.13056332559526
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Self-supervised representation learning is able to learn semantically meaningful features; however, much of its recent success relies on multiple crops of an image with very few objects. Instead of learning view-invariant representation from simple images, humans learn representations in a complex world with changing scenes by observing object movement, deformation, pose variation, and ego motion. Motivated by this ability, we present a new self-supervised learning representation framework that can be directly deployed on a video stream of complex scenes with many moving objects. Our framework features a simple flow equivariance objective that encourages the network to predict the features of another frame by applying a flow transformation to the features of the current frame. Our representations, learned from high-resolution raw video, can be readily used for downstream tasks on static images. Readout experiments on challenging semantic segmentation, instance segmentation, and object detection benchmarks show that we are able to outperform representations obtained from previous state-of-the-art methods including SimCLR and BYOL.

Related papers

Learning Visual Composition through Improved Semantic Guidance [19.24813992815684]
We show that by substantially improving weakly labeled data, we can vastly improve the performance of standard contrastive learning approaches. We showcase our results on a relatively new captioning benchmark derived from DOCCI. We demonstrate through a series of ablations that a standard CLIP model trained with enhanced data may demonstrate impressive performance on image retrieval tasks.
arXiv Detail & Related papers (2024-12-19T20:58:26Z)
Transforming Static Images Using Generative Models for Video Salient Object Detection [15.701293552584863]
We show that image-to-video diffusion models can generate realistic transformations of static images while understanding the contextual relationships between image components. This ability allows the model to generate plausible optical flows, preserving semantic integrity while reflecting the independent motion of scene elements. Our approach achieves state-of-the-art performance across all public benchmark datasets, outperforming existing approaches.
arXiv Detail & Related papers (2024-11-21T09:41:33Z)
Diffusion Self-Guidance for Controllable Image Generation [106.59989386924136]
Self-guidance provides greater control over generated images by guiding the internal representations of diffusion models. We show how a simple set of properties can be composed to perform challenging image manipulations. We also show that self-guidance can be used to edit real images.
arXiv Detail & Related papers (2023-06-01T17:59:56Z)
Learning Transferable Pedestrian Representation from Multimodal Information Supervision [174.5150760804929]
VAL-PAT is a novel framework that learns transferable representations to enhance various pedestrian analysis tasks with multimodal information. We first perform pre-training on LUPerson-TA dataset, where each image contains text and attribute annotations. We then transfer the learned representations to various downstream tasks, including person reID, person attribute recognition and text-based person search.
arXiv Detail & Related papers (2023-04-12T01:20:58Z)
Understanding Self-Supervised Pretraining with Part-Aware Representation Learning [88.45460880824376]
We study the capability that self-supervised representation pretraining methods learn part-aware representations. Results show that the fully-supervised model outperforms self-supervised models for object-level recognition.
arXiv Detail & Related papers (2023-01-27T18:58:42Z)
Learning Explicit Object-Centric Representations with Vision Transformers [81.38804205212425]
We build on the self-supervision task of masked autoencoding and explore its effectiveness for learning object-centric representations with transformers. We show that the model efficiently learns to decompose simple scenes as measured by segmentation metrics on several multi-object benchmarks.
arXiv Detail & Related papers (2022-10-25T16:39:49Z)
Robust and Controllable Object-Centric Learning through Energy-based Models [95.68748828339059]
ours is a conceptually simple and general approach to learning object-centric representations through an energy-based model. We show that ours can be easily integrated into existing architectures and can effectively extract high-quality object-centric representations.
arXiv Detail & Related papers (2022-10-11T15:11:15Z)
Matching Multiple Perspectives for Efficient Representation Learning [0.0]
We present an approach that combines self-supervised learning with a multi-perspective matching technique. We show that the availability of multiple views of the same object combined with a variety of self-supervised pretraining algorithms can lead to improved object classification performance.
arXiv Detail & Related papers (2022-08-16T10:33:13Z)
On the robustness of self-supervised representations for multi-view object classification [0.0]
We show that self-supervised representations based on the instance discrimination objective lead to better representations of objects that are more robust to changes in the viewpoint and perspective of the object. We find that self-supervised representations are more robust to object viewpoint and appear to encode more pertinent information about objects that facilitate the recognition of objects from novel views.
arXiv Detail & Related papers (2022-07-27T17:24:55Z)
Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases [34.02639091680309]
Recent gains in performance come from training instance classification models, treating each image and it's augmented versions as samples of a single class. We demonstrate that approaches like MOCO and PIRL learn occlusion-invariant representations. Second, we demonstrate that these approaches obtain further gains from access to a clean object-centric training dataset like Imagenet.
arXiv Detail & Related papers (2020-07-28T00:11:31Z)
Video Representation Learning by Recognizing Temporal Transformations [37.59322456034611]
We introduce a novel self-supervised learning approach to learn representations of videos responsive to changes in the motion dynamics. We promote an accurate learning of motion without human annotation by training a neural network to discriminate a video sequence from its temporally transformed versions. Our experiments show that networks trained with the proposed method yield representations with improved transfer performance for action recognition.
arXiv Detail & Related papers (2020-07-21T11:43:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.