Enhancing JEPAs with Spatial Conditioning: Robust and Efficient Representation Learning
- URL: http://arxiv.org/abs/2410.10773v1
- Date: Mon, 14 Oct 2024 17:46:24 GMT
- Title: Enhancing JEPAs with Spatial Conditioning: Robust and Efficient Representation Learning
- Authors: Etai Littwin, Vimal Thilak, Anand Gopalakrishnan,
- Abstract summary: Image-based Joint-Embedding Predictive Architecture (IJEPA) offers an attractive alternative to Masked Autoencoder (MAE)
IJEPA drives representations to capture useful semantic information by predicting in latent rather than input space.
Our "conditional" encoders show performance gains on several image classification benchmark datasets.
- Score: 7.083341587100975
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image-based Joint-Embedding Predictive Architecture (IJEPA) offers an attractive alternative to Masked Autoencoder (MAE) for representation learning using the Masked Image Modeling framework. IJEPA drives representations to capture useful semantic information by predicting in latent rather than input space. However, IJEPA relies on carefully designed context and target windows to avoid representational collapse. The encoder modules in IJEPA cannot adaptively modulate the type of predicted and/or target features based on the feasibility of the masked prediction task as they are not given sufficient information of both context and targets. Based on the intuition that in natural images, information has a strong spatial bias with spatially local regions being highly predictive of one another compared to distant ones. We condition the target encoder and context encoder modules in IJEPA with positions of context and target windows respectively. Our "conditional" encoders show performance gains on several image classification benchmark datasets, improved robustness to context window size and sample-efficiency during pretraining.
Related papers
- AgMTR: Agent Mining Transformer for Few-shot Segmentation in Remote Sensing [12.91626624625134]
Few-shot (FSS) aims to segment the interested objects in the query image with just a handful of labeled samples (i.e., support images)
Previous schemes would leverage the similarity between support-Query pixel pairs to construct the pixel-level semantic correlation.
In remote sensing scenarios with extreme intra-class variations and cluttered backgrounds, such pixel-level correlations may produce tremendous mismatches.
We propose a novel Agent Mining Transformer (AgMTR), which adaptively mines a set of local-aware agents to construct agent-level semantic correlation.
arXiv Detail & Related papers (2024-09-26T01:12:01Z) - OPUS: Occupancy Prediction Using a Sparse Set [64.60854562502523]
We present a framework to simultaneously predict occupied locations and classes using a set of learnable queries.
OPUS incorporates a suite of non-trivial strategies to enhance model performance.
Our lightest model achieves superior RayIoU on the Occ3D-nuScenes dataset at near 2x FPS, while our heaviest model surpasses previous best results by 6.1 RayIoU.
arXiv Detail & Related papers (2024-09-14T07:44:22Z) - Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - How JEPA Avoids Noisy Features: The Implicit Bias of Deep Linear Self Distillation Networks [14.338754598043968]
Two competing paradigms exist for self-supervised learning of data representations.
Joint Embedding Predictive Architecture (JEPA) is a class of architectures in which semantically similar inputs are encoded into representations that are predictive of each other.
arXiv Detail & Related papers (2024-07-03T19:43:12Z) - DMT-JEPA: Discriminative Masked Targets for Joint-Embedding Predictive Architecture [18.578689440216774]
We introduce DMT-JEPA, a novel masked modeling objective rooted in JEPA.
Our key idea is simple: we consider a set of semantically similar neighboring patches as a target of a masked patch.
DMT-JEPA demonstrates strong discriminative power, offering benefits across a diverse spectrum of downstream tasks.
arXiv Detail & Related papers (2024-05-28T09:28:52Z) - A-JEPA: Joint-Embedding Predictive Architecture Can Listen [35.308323314848735]
We introduce Audio-based Joint-Embedding Predictive Architecture (A-JEPA), a simple extension method for self-supervised learning from the audio spectrum.
A-JEPA encodes visible audio spectrogram patches with a curriculum masking strategy via context encoder, and predicts the representations of regions sampled at well-designed locations.
arXiv Detail & Related papers (2023-11-27T13:53:53Z) - Interpretable Spectral Variational AutoEncoder (ISVAE) for time series
clustering [48.0650332513417]
We introduce a novel model that incorporates an interpretable bottleneck-termed the Filter Bank (FB)-at the outset of a Variational Autoencoder (VAE)
This arrangement compels the VAE to attend on the most informative segments of the input signal.
By deliberately constraining the VAE with this FB, we promote the development of an encoding that is discernible, separable, and of reduced dimensionality.
arXiv Detail & Related papers (2023-10-18T13:06:05Z) - LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts [107.11267074981905]
We propose a semantically controllable layout-AWare diffusion model, termed LAW-Diffusion.
We show that LAW-Diffusion yields the state-of-the-art generative performance, especially with coherent object relations.
arXiv Detail & Related papers (2023-08-13T08:06:18Z) - Multitask AET with Orthogonal Tangent Regularity for Dark Object
Detection [84.52197307286681]
We propose a novel multitask auto encoding transformation (MAET) model to enhance object detection in a dark environment.
In a self-supervision manner, the MAET learns the intrinsic visual structure by encoding and decoding the realistic illumination-degrading transformation.
We have achieved the state-of-the-art performance using synthetic and real-world datasets.
arXiv Detail & Related papers (2022-05-06T16:27:14Z) - Momentum Contrastive Autoencoder: Using Contrastive Learning for Latent
Space Distribution Matching in WAE [51.09507030387935]
Wasserstein autoencoder (WAE) shows that matching two distributions is equivalent to minimizing a simple autoencoder (AE) loss under the constraint that the latent space of this AE matches a pre-specified prior distribution.
We propose to use the contrastive learning framework that has been shown to be effective for self-supervised representation learning, as a means to resolve this problem.
We show that using the contrastive learning framework to optimize the WAE loss achieves faster convergence and more stable optimization compared with existing popular algorithms for WAE.
arXiv Detail & Related papers (2021-10-19T22:55:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.