Joint-Embedding Predictive Architecture for Self-Supervised Learning of Mask Classification Architecture
- URL: http://arxiv.org/abs/2407.10733v1
- Date: Mon, 15 Jul 2024 14:01:03 GMT
- Title: Joint-Embedding Predictive Architecture for Self-Supervised Learning of Mask Classification Architecture
- Authors: Dong-Hee Kim, Sungduk Cho, Hyeonwoo Cho, Chanmin Park, Jinyoung Kim, Won Hwa Kim,
- Abstract summary: We introduce Mask-JEPA, a self-supervised learning framework tailored for mask classification architectures (MCA)
Mask-JEPA combines a Joint Embedding Predictive Architecture with MCA to adeptly capture intricate semantics and precise object boundaries.
Our approach addresses two critical challenges in self-supervised learning: 1) extracting comprehensive representations for universal image segmentation from a pixel decoder, and 2) effectively training the transformer decoder.
- Score: 5.872289712903129
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we introduce Mask-JEPA, a self-supervised learning framework tailored for mask classification architectures (MCA), to overcome the traditional constraints associated with training segmentation models. Mask-JEPA combines a Joint Embedding Predictive Architecture with MCA to adeptly capture intricate semantics and precise object boundaries. Our approach addresses two critical challenges in self-supervised learning: 1) extracting comprehensive representations for universal image segmentation from a pixel decoder, and 2) effectively training the transformer decoder. The use of the transformer decoder as a predictor within the JEPA framework allows proficient training in universal image segmentation tasks. Through rigorous evaluations on datasets such as ADE20K, Cityscapes and COCO, Mask-JEPA demonstrates not only competitive results but also exceptional adaptability and robustness across various training scenarios. The architecture-agnostic nature of Mask-JEPA further underscores its versatility, allowing seamless adaptation to various mask classification family.
Related papers
- Polyline Path Masked Attention for Vision Transformer [48.25001539205017]
Vision Transformers (ViTs) have achieved remarkable success in computer vision.<n>Mamba2 has demonstrated its significant potential in natural language processing tasks.<n>We propose Polyline Path Masked Attention (PPMA) that integrates the self-attention mechanism of ViTs with an enhanced structured mask of Mamba2.
arXiv Detail & Related papers (2025-06-19T00:52:30Z) - The Missing Point in Vision Transformers for Universal Image Segmentation [17.571552686063335]
We introduce ViT-P, a two-stage segmentation framework that decouples mask generation from classification.<n>ViT-P serves as a pre-training-free adapter, allowing the integration of various pre-trained vision transformers.<n>Experiments across COCO, ADE20K, and Cityscapes datasets validate the effectiveness of ViT-P.
arXiv Detail & Related papers (2025-05-26T10:29:13Z) - Evolved Hierarchical Masking for Self-Supervised Learning [49.77271430882176]
Existing Masked Image Modeling methods apply fixed mask patterns to guide the self-supervised training.
This paper introduces an evolved hierarchical masking method to pursue general visual cues modeling in self-supervised learning.
arXiv Detail & Related papers (2025-04-12T09:40:14Z) - Prior2Former -- Evidential Modeling of Mask Transformers for Assumption-Free Open-World Panoptic Segmentation [74.55677741919035]
We propose Prior2Former (P2F), the first approach for segmentation vision transformers rooted in evidential learning.<n>P2F extends the mask vision transformer architecture by incorporating a Beta prior for computing model uncertainty in pixel-wise binary mask assignments.<n>Unlike most segmentation models addressing unknown classes, P2F operates without access to OOD data samples or contrastive training on void (i.e., unlabeled) classes.
arXiv Detail & Related papers (2025-04-07T08:53:14Z) - Towards Natural Image Matting in the Wild via Real-Scenario Prior [69.96414467916863]
We propose a new matting dataset based on the COCO dataset, namely COCO-Matting.
The built COCO-Matting comprises an extensive collection of 38,251 human instance-level alpha mattes in complex natural scenarios.
For network architecture, the proposed feature-aligned transformer learns to extract fine-grained edge and transparency features.
The proposed matte-aligned decoder aims to segment matting-specific objects and convert coarse masks into high-precision mattes.
arXiv Detail & Related papers (2024-10-09T06:43:19Z) - Semantic Refocused Tuning for Open-Vocabulary Panoptic Segmentation [42.020470627552136]
Open-vocabulary panoptic segmentation is an emerging task aiming to accurately segment the image into semantically meaningful masks.
mask classification is the main performance bottleneck for open-vocab panoptic segmentation.
We propose Semantic Refocused Tuning, a novel framework that greatly enhances open-vocab panoptic segmentation.
arXiv Detail & Related papers (2024-09-24T17:50:28Z) - Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - Pseudo Labelling for Enhanced Masked Autoencoders [27.029542823306866]
We propose an enhanced approach that boosts Masked Autoencoders (MAE) performance by integrating pseudo labelling for both class and data tokens.
This strategy uses cluster assignments as pseudo labels to promote instance-level discrimination within the network.
We show that incorporating pseudo-labelling as an auxiliary task has demonstrated notable improvements in ImageNet-1K and other downstream tasks.
arXiv Detail & Related papers (2024-06-25T10:41:45Z) - Understanding Masked Autoencoders From a Local Contrastive Perspective [80.57196495601826]
Masked AutoEncoder (MAE) has revolutionized the field of self-supervised learning with its simple yet effective masking and reconstruction strategies.
We introduce a new empirical framework, called Local Contrastive MAE, to analyze both reconstructive and contrastive aspects of MAE.
arXiv Detail & Related papers (2023-10-03T12:08:15Z) - CL-MAE: Curriculum-Learned Masked Autoencoders [49.24994655813455]
We propose a curriculum learning approach that updates the masking strategy to continually increase the complexity of the self-supervised reconstruction task.
We train our Curriculum-Learned Masked Autoencoder (CL-MAE) on ImageNet and show that it exhibits superior representation learning capabilities compared to MAE.
arXiv Detail & Related papers (2023-08-31T09:13:30Z) - i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable? [26.146459754995597]
Masked image modeling (MIM) has been recognized as a strong self-supervised pre-training approach in the vision domain.
This paper aims to explore an interactive Masked Autoencoders (i-MAE) framework to enhance the representation capability.
In addition to qualitatively analyzing the characteristics of the latent representations, we examine the existence of linear separability and the degree of semantics in the latent space.
arXiv Detail & Related papers (2022-10-20T17:59:54Z) - Exploiting Shape Cues for Weakly Supervised Semantic Segmentation [15.791415215216029]
Weakly supervised semantic segmentation (WSSS) aims to produce pixel-wise class predictions with only image-level labels for training.
We propose to exploit shape information to supplement the texture-biased property of convolutional neural networks (CNNs)
We further refine the predictions in an online fashion with a novel refinement method that takes into account both the class and the color affinities.
arXiv Detail & Related papers (2022-08-08T17:25:31Z) - Open-World Instance Segmentation: Exploiting Pseudo Ground Truth From
Learned Pairwise Affinity [59.1823948436411]
We propose a novel approach for mask proposals, Generic Grouping Networks (GGNs)
Our approach combines a local measure of pixel affinity with instance-level mask supervision, producing a training regimen designed to make the model as generic as the data diversity allows.
arXiv Detail & Related papers (2022-04-12T22:37:49Z) - Self-Supervised Visual Representations Learning by Contrastive Mask
Prediction [129.25459808288025]
We propose a novel contrastive mask prediction (CMP) task for visual representation learning.
MaskCo contrasts region-level features instead of view-level features, which makes it possible to identify the positive sample without any assumptions.
We evaluate MaskCo on training datasets beyond ImageNet and compare its performance with MoCo V2.
arXiv Detail & Related papers (2021-08-18T02:50:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.