Self-Supervised Learning from Images with a Joint-Embedding Predictive
Architecture
- URL: http://arxiv.org/abs/2301.08243v3
- Date: Thu, 13 Apr 2023 17:59:37 GMT
- Title: Self-Supervised Learning from Images with a Joint-Embedding Predictive
Architecture
- Authors: Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal
Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas
- Abstract summary: This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations.
We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images.
- Score: 43.83887661156133
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper demonstrates an approach for learning highly semantic image
representations without relying on hand-crafted data-augmentations. We
introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a
non-generative approach for self-supervised learning from images. The idea
behind I-JEPA is simple: from a single context block, predict the
representations of various target blocks in the same image. A core design
choice to guide I-JEPA towards producing semantic representations is the
masking strategy; specifically, it is crucial to (a) sample target blocks with
sufficiently large scale (semantic), and to (b) use a sufficiently informative
(spatially distributed) context block. Empirically, when combined with Vision
Transformers, we find I-JEPA to be highly scalable. For instance, we train a
ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong
downstream performance across a wide range of tasks, from linear classification
to object counting and depth prediction.
Related papers
- PixelHacker: Image Inpainting with Structural and Semantic Consistency [28.984953143157107]
Inpainting is a fundamental research area between image editing and image generation.
Recent state-of-the-art (SOTA) methods have explored novel attention mechanisms, lightweight architectures, and context-aware modeling.
We design a simple yet effective inpainting paradigm called latent categories guidance, and propose a diffusion-based model named PixelHacker.
arXiv Detail & Related papers (2025-04-29T05:28:36Z) - DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks [51.439283251703635]
We create a good, generalist perception model that can tackle multiple tasks, within limits on computational resources and training data.
Our exhaustive evaluation metrics demonstrate that DICEPTION effectively tackles multiple perception tasks, achieving performance on par with state-of-the-art models.
We show that the strategy of assigning random colors to different instances is highly effective in both entity segmentation and semantic segmentation.
arXiv Detail & Related papers (2025-02-24T13:51:06Z) - Leveraging Representations from Intermediate Encoder-blocks for Synthetic Image Detection [13.840950434728533]
State-of-the-art Synthetic Image Detection (SID) research has led to strong evidence on the advantages of feature extraction from foundation models.
We leverage the image representations extracted by intermediate Transformer blocks of CLIP's image-encoder via a lightweight network.
Our method is compared against the state-of-the-art by evaluating it on 20 test datasets and exhibits an average +10.6% absolute performance improvement.
arXiv Detail & Related papers (2024-02-29T12:18:43Z) - Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects.
In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL)
A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - Modeling Image Composition for Complex Scene Generation [77.10533862854706]
We present a method that achieves state-of-the-art results on layout-to-image generation tasks.
After compressing RGB images into patch tokens, we propose the Transformer with Focal Attention (TwFA) for exploring dependencies of object-to-object, object-to-patch and patch-to-patch.
arXiv Detail & Related papers (2022-06-02T08:34:25Z) - AugNet: End-to-End Unsupervised Visual Representation Learning with
Image Augmentation [3.6790362352712873]
We propose AugNet, a new deep learning training paradigm to learn image features from a collection of unlabeled pictures.
Our experiments demonstrate that the method is able to represent the image in low dimensional space.
Unlike many deep-learning-based image retrieval algorithms, our approach does not require access to external annotated datasets.
arXiv Detail & Related papers (2021-06-11T09:02:30Z) - Semantic Segmentation with Generative Models: Semi-Supervised Learning
and Strong Out-of-Domain Generalization [112.68171734288237]
We propose a novel framework for discriminative pixel-level tasks using a generative model of both images and labels.
We learn a generative adversarial network that captures the joint image-label distribution and is trained efficiently using a large set of unlabeled images.
We demonstrate strong in-domain performance compared to several baselines, and are the first to showcase extreme out-of-domain generalization.
arXiv Detail & Related papers (2021-04-12T21:41:25Z) - Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation [49.90178055521207]
This work addresses weakly supervised semantic segmentation (WSSS), with the goal of bridging the gap between image-level annotations and pixel-level segmentation.
We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths.
In particular, we devise a graph neural network (GNN) for group-wise semantic mining, wherein input images are represented as graph nodes.
arXiv Detail & Related papers (2020-12-09T12:40:13Z) - Object-Based Image Coding: A Learning-Driven Revisit [30.550019759674477]
A fundamental issue behind is how to efficiently process the arbitrary-shaped objects at a fine granularity.
We have proposed an object segmentation network for image layer decomposition, and parallel convolution-based neural image compression networks to process masked foreground objects and background scene separately.
All components are optimized in an end-to-end learning framework to intelligently weigh their contributions for visually pleasant reconstruction.
arXiv Detail & Related papers (2020-03-18T04:00:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.