Uni4Eye: Unified 2D and 3D Self-supervised Pre-training via Masked Image
Modeling Transformer for Ophthalmic Image Classification
- URL: http://arxiv.org/abs/2203.04614v1
- Date: Wed, 9 Mar 2022 10:02:00 GMT
- Title: Uni4Eye: Unified 2D and 3D Self-supervised Pre-training via Masked Image
Modeling Transformer for Ophthalmic Image Classification
- Authors: Zhiyuan Cai and Huaqing He and Li Lin and Xiaoying Tang
- Abstract summary: We propose a universal self-supervised Transformer framework, named Uni4Eye, to capture domain-specific feature embedding in ophthalmic images.
Uni4Eye can serve as a global feature extractor, which builds its basis on a Masked Image Modeling task with a Vision Transformer architecture.
We employ a Unified Patch Embedding module to replace the origin patch embedding module in ViT for jointly processing both 2D and 3D input images.
- Score: 1.2250035750661867
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A large-scale labeled dataset is a key factor for the success of supervised
deep learning in computer vision. However, a limited number of annotated data
is very common, especially in ophthalmic image analysis, since manual
annotation is time-consuming and labor-intensive. Self-supervised learning
(SSL) methods bring huge opportunities for better utilizing unlabeled data, as
they do not need massive annotations. With an attempt to use as many as
possible unlabeled ophthalmic images, it is necessary to break the dimension
barrier, simultaneously making use of both 2D and 3D images. In this paper, we
propose a universal self-supervised Transformer framework, named Uni4Eye, to
discover the inherent image property and capture domain-specific feature
embedding in ophthalmic images. Uni4Eye can serve as a global feature
extractor, which builds its basis on a Masked Image Modeling task with a Vision
Transformer (ViT) architecture. We employ a Unified Patch Embedding module to
replace the origin patch embedding module in ViT for jointly processing both 2D
and 3D input images. Besides, we design a dual-branch multitask decoder module
to simultaneously perform two reconstruction tasks on the input image and its
gradient map, delivering discriminative representations for better convergence.
We evaluate the performance of our pre-trained Uni4Eye encoder by fine-tuning
it on six downstream ophthalmic image classification tasks. The superiority of
Uni4Eye is successfully established through comparisons to other
state-of-the-art SSL pre-training methods.
Related papers
- Learning Dense Feature Matching via Lifting Single 2D Image to 3D Space [17.603217168518356]
We propose a novel two-stage framework that lifts 2D images to 3D space, taking full advantage of large-scale and diverse single-view images.<n>In the first stage, we learn a 3D-aware feature encoder using a combination of multi-view image synthesis and 3D feature Gaussian representation.<n>In the second stage, a novel-view rendering strategy, combined with large-scale synthetic data generation from single-view images, is employed to learn a feature decoder.
arXiv Detail & Related papers (2025-07-01T03:07:21Z) - ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI)
We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z) - Large Spatial Model: End-to-end Unposed Images to Semantic 3D [79.94479633598102]
Large Spatial Model (LSM) processes unposed RGB images directly into semantic radiance fields.
LSM simultaneously estimates geometry, appearance, and semantics in a single feed-forward operation.
It can generate versatile label maps by interacting with language at novel viewpoints.
arXiv Detail & Related papers (2024-10-24T17:54:42Z) - Exploring the Untouched Sweeps for Conflict-Aware 3D Segmentation Pretraining [41.145598142457686]
LiDAR-camera 3D representation pretraining has shown significant promise for 3D perception tasks and related applications.
We propose a novel Vision-Foundation-Model-driven sample exploring module to meticulously select LiDAR-Image pairs from unexplored frames.
Our method consistently outperforms existing state-of-the-art pretraining frameworks across three major public autonomous driving datasets.
arXiv Detail & Related papers (2024-07-10T08:46:29Z) - Cross-Modal Self-Training: Aligning Images and Pointclouds to Learn Classification without Labels [69.55622471172941]
Large-scale vision 2D vision language models, such as CLIP can be aligned with a 3D encoder to learn generalizable (open-vocabulary) 3D vision models.
We propose an optimization framework: Cross-MoST: Cross-Modal Self-Training, to improve the label-free classification performance of a zero-shot 3D vision model.
arXiv Detail & Related papers (2024-04-15T21:30:50Z) - Cross-domain and Cross-dimension Learning for Image-to-Graph
Transformers [50.576354045312115]
Direct image-to-graph transformation is a challenging task that solves object detection and relationship prediction in a single model.
We introduce a set of methods enabling cross-domain and cross-dimension transfer learning for image-to-graph transformers.
We demonstrate our method's utility in cross-domain and cross-dimension experiments, where we pretrain our models on 2D satellite images before applying them to vastly different target domains in 2D and 3D.
arXiv Detail & Related papers (2024-03-11T10:48:56Z) - TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding [28.112402580426174]
TriAdapter Multi-Modal Learning (TAMM) is a novel two-stage learning approach based on three synergistic adapters.
TAMM consistently enhances 3D representations for a wide range of 3D encoder architectures, pre-training datasets, and downstream tasks.
arXiv Detail & Related papers (2024-02-28T17:18:38Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View
Completion [20.121597331207276]
Masked Image Modeling (MIM) has recently been established as a potent pre-training paradigm.
In this paper we seek to learn representations that transfer well to a wide variety of 3D vision and lower-level geometric downstream tasks.
Our experiments show that our pretext task leads to significantly improved performance for monocular 3D vision downstream tasks.
arXiv Detail & Related papers (2022-10-19T16:50:36Z) - Multimodal Masked Autoencoders Learn Transferable Representations [127.35955819874063]
We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE)
M3AE learns a unified encoder for both vision and language data via masked token prediction.
We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks.
arXiv Detail & Related papers (2022-05-27T19:09:42Z) - Unified 2D and 3D Pre-training for Medical Image classification and
Segmentation [40.01443481859121]
We propose a Universal Self-Supervised Transformer (USST) framework based on the student-teacher paradigm.
USST aims to leverage a huge of unlabeled medical data with multiple dimensions to learn rich representations.
It provides promising results on six 2D/3D medical image classification and segmentation tasks.
arXiv Detail & Related papers (2021-12-17T07:27:23Z) - UFO: A UniFied TransfOrmer for Vision-Language Representation Learning [54.82482779792115]
We propose a single UniFied transfOrmer (UFO) capable of processing either unimodal inputs (e.g., image or language) or multimodal inputs (e.g., the concatenation of the image and the question) for vision-language (VL) representation learning.
Existing approaches typically design an individual network for each modality and/or a specific fusion network for multimodal tasks.
arXiv Detail & Related papers (2021-11-19T03:23:10Z) - ResT: An Efficient Transformer for Visual Recognition [5.807423409327807]
This paper presents an efficient multi-scale vision Transformer, called ResT, that capably served as a general-purpose backbone for image recognition.
We show that the proposed ResT can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResT as strong backbones.
arXiv Detail & Related papers (2021-05-28T08:53:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.