DMT-JEPA: Discriminative Masked Targets for Joint-Embedding Predictive Architecture
- URL: http://arxiv.org/abs/2405.17995v1
- Date: Tue, 28 May 2024 09:28:52 GMT
- Title: DMT-JEPA: Discriminative Masked Targets for Joint-Embedding Predictive Architecture
- Authors: Shentong Mo, Sukmin Yun,
- Abstract summary: We introduce DMT-JEPA, a novel masked modeling objective rooted in JEPA.
Our key idea is simple: we consider a set of semantically similar neighboring patches as a target of a masked patch.
DMT-JEPA demonstrates strong discriminative power, offering benefits across a diverse spectrum of downstream tasks.
- Score: 18.578689440216774
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The joint-embedding predictive architecture (JEPA) recently has shown impressive results in extracting visual representations from unlabeled imagery under a masking strategy. However, we reveal its disadvantages, notably its insufficient understanding of local semantics. This deficiency originates from masked modeling in the embedding space, resulting in a reduction of discriminative power and can even lead to the neglect of critical local semantics. To bridge this gap, we introduce DMT-JEPA, a novel masked modeling objective rooted in JEPA, specifically designed to generate discriminative latent targets from neighboring information. Our key idea is simple: we consider a set of semantically similar neighboring patches as a target of a masked patch. To be specific, the proposed DMT-JEPA (a) computes feature similarities between each masked patch and its corresponding neighboring patches to select patches having semantically meaningful relations, and (b) employs lightweight cross-attention heads to aggregate features of neighboring patches as the masked targets. Consequently, DMT-JEPA demonstrates strong discriminative power, offering benefits across a diverse spectrum of downstream tasks. Through extensive experiments, we demonstrate our effectiveness across various visual benchmarks, including ImageNet-1K image classification, ADE20K semantic segmentation, and COCO object detection tasks. Code is available at: \url{https://github.com/DMTJEPA/DMTJEPA}.
Related papers
- Enhancing JEPAs with Spatial Conditioning: Robust and Efficient Representation Learning [7.083341587100975]
Image-based Joint-Embedding Predictive Architecture (IJEPA) offers an attractive alternative to Masked Autoencoder (MAE)
IJEPA drives representations to capture useful semantic information by predicting in latent rather than input space.
Our "conditional" encoders show performance gains on several image classification benchmark datasets.
arXiv Detail & Related papers (2024-10-14T17:46:24Z) - How JEPA Avoids Noisy Features: The Implicit Bias of Deep Linear Self Distillation Networks [14.338754598043968]
Two competing paradigms exist for self-supervised learning of data representations.
Joint Embedding Predictive Architecture (JEPA) is a class of architectures in which semantically similar inputs are encoded into representations that are predictive of each other.
arXiv Detail & Related papers (2024-07-03T19:43:12Z) - Continual-MAE: Adaptive Distribution Masked Autoencoders for Continual Test-Time Adaptation [49.827306773992376]
Continual Test-Time Adaptation (CTTA) is proposed to migrate a source pre-trained model to continually changing target distributions.
Our proposed method attains state-of-the-art performance in both classification and segmentation CTTA tasks.
arXiv Detail & Related papers (2023-12-19T15:34:52Z) - Adaptive Face Recognition Using Adversarial Information Network [57.29464116557734]
Face recognition models often degenerate when training data are different from testing data.
We propose a novel adversarial information network (AIN) to address it.
arXiv Detail & Related papers (2023-05-23T02:14:11Z) - Masked Collaborative Contrast for Weakly Supervised Semantic
Segmentation [22.74105261883464]
Masked Collaborative Contrast (MCC) to highlight semantic regions in weakly supervised semantic segmentation.
MCC adroitly draws inspiration from masked image modeling and contrastive learning to devise a novel framework that induces keys to contract toward semantic regions.
arXiv Detail & Related papers (2023-05-15T09:46:28Z) - MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation [104.40114562948428]
In unsupervised domain adaptation (UDA), a model trained on source data (e.g. synthetic) is adapted to target data (e.g. real-world) without access to target annotation.
We propose a Masked Image Consistency (MIC) module to enhance UDA by learning spatial context relations of the target domain.
MIC significantly improves the state-of-the-art performance across the different recognition tasks for synthetic-to-real, day-to-nighttime, and clear-to-adverse-weather UDA.
arXiv Detail & Related papers (2022-12-02T17:29:32Z) - ADPS: Asymmetric Distillation Post-Segmentation for Image Anomaly
Detection [75.68023968735523]
Knowledge Distillation-based Anomaly Detection (KDAD) methods rely on the teacher-student paradigm to detect and segment anomalous regions.
We propose an innovative approach called Asymmetric Distillation Post-Segmentation (ADPS)
Our ADPS employs an asymmetric distillation paradigm that takes distinct forms of the same image as the input of the teacher-student networks.
We show that ADPS significantly improves Average Precision (AP) metric by 9% and 20% on the MVTec AD and KolektorSDD2 datasets.
arXiv Detail & Related papers (2022-10-19T12:04:47Z) - Beyond the Prototype: Divide-and-conquer Proxies for Few-shot
Segmentation [63.910211095033596]
Few-shot segmentation aims to segment unseen-class objects given only a handful of densely labeled samples.
We propose a simple yet versatile framework in the spirit of divide-and-conquer.
Our proposed approach, named divide-and-conquer proxies (DCP), allows for the development of appropriate and reliable information.
arXiv Detail & Related papers (2022-04-21T06:21:14Z) - Semi-Supervised Domain Adaptation with Prototypical Alignment and
Consistency Learning [86.6929930921905]
This paper studies how much it can help address domain shifts if we further have a few target samples labeled.
To explore the full potential of landmarks, we incorporate a prototypical alignment (PA) module which calculates a target prototype for each class from the landmarks.
Specifically, we severely perturb the labeled images, making PA non-trivial to achieve and thus promoting model generalizability.
arXiv Detail & Related papers (2021-04-19T08:46:08Z) - Adapt Everywhere: Unsupervised Adaptation of Point-Clouds and Entropy
Minimisation for Multi-modal Cardiac Image Segmentation [10.417009344120917]
We present a novel UDA method for multi-modal cardiac image segmentation.
The proposed method is based on adversarial learning and adapts network features between source and target domain in different spaces.
We validated our method on two cardiac datasets by adapting from the annotated source domain to the unannotated target domain.
arXiv Detail & Related papers (2021-03-15T08:59:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.