VISTA3D: Versatile Imaging SegmenTation and Annotation model for 3D Computed Tomography
- URL: http://arxiv.org/abs/2406.05285v1
- Date: Fri, 7 Jun 2024 22:41:39 GMT
- Title: VISTA3D: Versatile Imaging SegmenTation and Annotation model for 3D Computed Tomography
- Authors: Yufan He, Pengfei Guo, Yucheng Tang, Andriy Myronenko, Vishwesh Nath, Ziyue Xu, Dong Yang, Can Zhao, Benjamin Simon, Mason Belue, Stephanie Harmon, Baris Turkbey, Daguang Xu, Wenqi Li,
- Abstract summary: We discuss what features a 3D CT segmentation foundation model should have, and introduce VISTA3D, Versatile Imaging SegmenTation and.
The model is trained systematically on 11454 volumes encompassing 127 types of human anatomical structures and various lesions.
The model's design also achieves state-of-the-art zero-shot interactive segmentation in 3D.
- Score: 18.111368889931885
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Segmentation foundation models have attracted great interest, however, none of them are adequate enough for the use cases in 3D computed tomography scans (CT) images. Existing works finetune on medical images with 2D foundation models trained on natural images, but interactive segmentation, especially in 2D, is too time-consuming for 3D scans and less useful for large cohort analysis. Models that can perform out-of-the-box automatic segmentation are more desirable. However, the model trained in this way lacks the ability to perform segmentation on unseen objects like novel tumors. Thus for 3D medical image analysis, an ideal segmentation solution might expect two features: accurate out-of-the-box performance covering major organ classes, and effective adaptation or zero-shot ability to novel structures. In this paper, we discuss what features a 3D CT segmentation foundation model should have, and introduce VISTA3D, Versatile Imaging SegmenTation and Annotation model. The model is trained systematically on 11454 volumes encompassing 127 types of human anatomical structures and various lesions and provides accurate out-of-the-box segmentation. The model's design also achieves state-of-the-art zero-shot interactive segmentation in 3D. The novel model design and training recipe represent a promising step toward developing a versatile medical image foundation model. Code and model weights will be released shortly. The early version of online demo can be tried on https://build.nvidia.com/nvidia/vista-3d.
Related papers
- 3D-Fixup: Advancing Photo Editing with 3D Priors [32.83193513442457]
3D-Fixup is a new framework for editing 2D images guided by learned 3D priors.<n>We leverage a training-based approach that harnesses the generative power of diffusion models.<n>We show that 3D-Fixup effectively supports complex, identity coherent 3D-aware edits.
arXiv Detail & Related papers (2025-05-15T17:59:51Z) - DINO in the Room: Leveraging 2D Foundation Models for 3D Segmentation [51.43837087865105]
Vision foundation models (VFMs) trained on large-scale image datasets provide high-quality features that have significantly advanced 2D visual recognition.
Their potential in 3D vision remains largely untapped, despite the common availability of 2D images alongside 3D point cloud datasets.
We introduce DITR, a simple yet effective approach that extracts 2D foundation model features, projects them to 3D, and finally injects them into a 3D point cloud segmentation model.
arXiv Detail & Related papers (2025-03-24T17:59:11Z) - Unifying 2D and 3D Vision-Language Understanding [85.84054120018625]
We introduce UniVLG, a unified architecture for 2D and 3D vision-language learning.
UniVLG bridges the gap between existing 2D-centric models and the rich 3D sensory data available in embodied systems.
arXiv Detail & Related papers (2025-03-13T17:56:22Z) - A Lesson in Splats: Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision [65.33043028101471]
We present a novel framework for training 3D image-conditioned diffusion models using only 2D supervision.<n>Most existing 3D generative models rely on full 3D supervision, which is impractical due to the scarcity of large-scale 3D datasets.
arXiv Detail & Related papers (2024-12-01T00:29:57Z) - ConDense: Consistent 2D/3D Pre-training for Dense and Sparse Features from Multi-View Images [47.682942867405224]
ConDense is a framework for 3D pre-training utilizing existing 2D networks and large-scale multi-view datasets.
We propose a novel 2D-3D joint training scheme to extract co-embedded 2D and 3D features in an end-to-end pipeline.
arXiv Detail & Related papers (2024-08-30T05:57:01Z) - Improving 2D Feature Representations by 3D-Aware Fine-Tuning [17.01280751430423]
Current visual foundation models are trained purely on unstructured 2D data.
We show that fine-tuning on 3D-aware data improves the quality of emerging semantic features.
arXiv Detail & Related papers (2024-07-29T17:59:21Z) - DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data [50.164670363633704]
We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets from text prompts.
Our model is directly trained on extensive noisy and unaligned in-the-wild' 3D assets.
We achieve state-of-the-art performance in both single-class generation and text-to-3D generation.
arXiv Detail & Related papers (2024-06-06T17:58:15Z) - Probing the 3D Awareness of Visual Foundation Models [56.68380136809413]
We analyze the 3D awareness of visual foundation models.
We conduct experiments using task-specific probes and zero-shot inference procedures on frozen features.
arXiv Detail & Related papers (2024-04-12T17:58:04Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal
Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - Uni3D: Exploring Unified 3D Representation at Scale [66.26710717073372]
We present Uni3D, a 3D foundation model to explore the unified 3D representation at scale.
Uni3D uses a 2D ViT end-to-end pretrained to align the 3D point cloud features with the image-text aligned features.
We show that the strong Uni3D representation also enables applications such as 3D painting and retrieval in the wild.
arXiv Detail & Related papers (2023-10-10T16:49:21Z) - GET3D: A Generative Model of High Quality 3D Textured Shapes Learned
from Images [72.15855070133425]
We introduce GET3D, a Generative model that directly generates Explicit Textured 3D meshes with complex topology, rich geometric details, and high-fidelity textures.
GET3D is able to generate high-quality 3D textured meshes, ranging from cars, chairs, animals, motorbikes and human characters to buildings.
arXiv Detail & Related papers (2022-09-22T17:16:19Z) - FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection [78.00922683083776]
It is non-trivial to make a general adapted 2D detector work in this 3D task.
In this technical report, we study this problem with a practice built on fully convolutional single-stage detector.
Our solution achieves 1st place out of all the vision-only methods in the nuScenes 3D detection challenge of NeurIPS 2020.
arXiv Detail & Related papers (2021-04-22T09:35:35Z) - A Convolutional Architecture for 3D Model Embedding [1.3858051019755282]
We propose a deep learning architecture to handle 3D models as an input.
We show that the embedding representation conveys semantic information that helps to deal with the similarity assessment of 3D objects.
arXiv Detail & Related papers (2021-03-05T15:46:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.