PooDLe: Pooled and dense self-supervised learning from naturalistic videos
- URL: http://arxiv.org/abs/2408.11208v1
- Date: Tue, 20 Aug 2024 21:40:48 GMT
- Title: PooDLe: Pooled and dense self-supervised learning from naturalistic videos
- Authors: Alex N. Wang, Christopher Hoang, Yuwen Xiong, Yann LeCun, Mengye Ren,
- Abstract summary: We propose a novel approach that combines an invariance-based SSL objective on pooled representations with a dense SSL objective.
We validate our approach on the BDD100K driving video dataset and the Walking Tours first-person video dataset.
- Score: 32.656425302538835
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning has driven significant progress in learning from single-subject, iconic images. However, there are still unanswered questions about the use of minimally-curated, naturalistic video data, which contain dense scenes with many independent objects, imbalanced class distributions, and varying object sizes. In this paper, we propose a novel approach that combines an invariance-based SSL objective on pooled representations with a dense SSL objective that enforces equivariance to optical flow warping. Our findings indicate that a unified objective applied at multiple feature scales is essential for learning effective image representations from high-resolution, naturalistic videos. We validate our approach on the BDD100K driving video dataset and the Walking Tours first-person video dataset, demonstrating its ability to capture spatial understanding from a dense objective and semantic understanding via a pooled representation objective.
Related papers
- Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - Mitigating Object Dependencies: Improving Point Cloud Self-Supervised Learning through Object Exchange [50.45953583802282]
We introduce a novel self-supervised learning (SSL) strategy for point cloud scene understanding.
Our approach leverages both object patterns and contextual cues to produce robust features.
Our experiments demonstrate the superiority of our method over existing SSL techniques.
arXiv Detail & Related papers (2024-04-11T06:39:53Z) - VILLS -- Video-Image Learning to Learn Semantics for Person Re-Identification [51.89551385538251]
We propose VILLS (Video-Image Learning to Learn Semantics), a self-supervised method that jointly learns spatial and temporal features from images and videos.
VILLS first designs a local semantic extraction module that adaptively extracts semantically consistent and robust spatial features.
Then, VILLS designs a unified feature learning and adaptation module to represent image and video modalities in a consistent feature space.
arXiv Detail & Related papers (2023-11-27T19:30:30Z) - MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised
Learning of Motion and Content Features [34.92750644059916]
We introduce MC-JEPA, a joint-embedding predictive architecture and self-supervised learning approach to jointly learn optical flow and content features within a shared encoder.
The proposed approach achieves performance on-par with existing unsupervised optical flow benchmarks.
arXiv Detail & Related papers (2023-07-24T11:27:14Z) - Shepherding Slots to Objects: Towards Stable and Robust Object-Centric
Learning [28.368429312400885]
Single-view images carry less information about how to disentangle a given scene than videos or multi-view images do.
We introduce a novel OCL framework for single-view images, SLot Attention via SHepherding (SLASH), which consists of two simple-yet-effective modules on top of Slot Attention.
Our proposed method enables consistent learning of object-centric representation and achieves strong performance across four datasets.
arXiv Detail & Related papers (2023-03-31T07:07:29Z) - De-coupling and De-positioning Dense Self-supervised Learning [65.56679416475943]
Dense Self-Supervised Learning (SSL) methods address the limitations of using image-level feature representations when handling images with multiple objects.
We show that they suffer from coupling and positional bias, which arise from the receptive field increasing with layer depth and zero-padding.
We demonstrate the benefits of our method on COCO and on a new challenging benchmark, OpenImage-MINI, for object classification, semantic segmentation, and object detection.
arXiv Detail & Related papers (2023-03-29T18:07:25Z) - Towards Self-Supervised Learning of Global and Object-Centric
Representations [4.36572039512405]
We discuss key aspects of learning structured object-centric representations with self-supervision.
We validate our insights through several experiments on the CLEVR dataset.
arXiv Detail & Related papers (2022-03-11T15:18:47Z) - Semi-TCL: Semi-Supervised Track Contrastive Representation Learning [40.31083437957288]
We design a new instance-to-track matching objective to learn appearance embedding.
It compares a candidate detection to the embedding of the tracks persisted in the tracker.
We implement this learning objective in a unified form following the spirit of constrastive loss.
arXiv Detail & Related papers (2021-07-06T05:23:30Z) - Self-Supervised Representation Learning from Flow Equivariance [97.13056332559526]
We present a new self-supervised learning representation framework that can be directly deployed on a video stream of complex scenes.
Our representations, learned from high-resolution raw video, can be readily used for downstream tasks on static images.
arXiv Detail & Related papers (2021-01-16T23:44:09Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.