Transformer-Based Visual Segmentation: A Survey
- URL: http://arxiv.org/abs/2304.09854v4
- Date: Sun, 4 Aug 2024 04:30:45 GMT
- Title: Transformer-Based Visual Segmentation: A Survey
- Authors: Xiangtai Li, Henghui Ding, Haobo Yuan, Wenwei Zhang, Jiangmiao Pang, Guangliang Cheng, Kai Chen, Ziwei Liu, Chen Change Loy,
- Abstract summary: Visual segmentation seeks to partition images, video frames, or point clouds into multiple segments or groups.
Transformers are a type of neural network based on self-attention originally designed for natural language processing.
Transformers offer robust, unified, and even simpler solutions for various segmentation tasks.
- Score: 118.01564082499948
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual segmentation seeks to partition images, video frames, or point clouds into multiple segments or groups. This technique has numerous real-world applications, such as autonomous driving, image editing, robot sensing, and medical analysis. Over the past decade, deep learning-based methods have made remarkable strides in this area. Recently, transformers, a type of neural network based on self-attention originally designed for natural language processing, have considerably surpassed previous convolutional or recurrent approaches in various vision processing tasks. Specifically, vision transformers offer robust, unified, and even simpler solutions for various segmentation tasks. This survey provides a thorough overview of transformer-based visual segmentation, summarizing recent advancements. We first review the background, encompassing problem definitions, datasets, and prior convolutional methods. Next, we summarize a meta-architecture that unifies all recent transformer-based approaches. Based on this meta-architecture, we examine various method designs, including modifications to the meta-architecture and associated applications. We also present several closely related settings, including 3D point cloud segmentation, foundation model tuning, domain-aware segmentation, efficient segmentation, and medical segmentation. Additionally, we compile and re-evaluate the reviewed methods on several well-established datasets. Finally, we identify open challenges in this field and propose directions for future research. The project page can be found at https://github.com/lxtGH/Awesome-Segmentation-With-Transformer. We will also continually monitor developments in this rapidly evolving field.
Related papers
- Image Segmentation in Foundation Model Era: A Survey [99.19456390358211]
Current research in image segmentation lacks a detailed analysis of distinct characteristics, challenges, and solutions associated with these advancements.
This survey seeks to fill this gap by providing a thorough review of cutting-edge research centered around FM-driven image segmentation.
An exhaustive overview of over 300 segmentation approaches is provided to encapsulate the breadth of current research efforts.
arXiv Detail & Related papers (2024-08-23T10:07:59Z) - AgileFormer: Spatially Agile Transformer UNet for Medical Image Segmentation [1.657223496316251]
We argue that the current design of the vision transformer-based UNet (ViT-UNet) segmentation models may not effectively handle the heterogeneous appearance.
We present a structured approach to introduce spatially dynamic components to the ViT-UNet.
This adaptation enables the model to effectively capture features of target objects with diverse appearances.
arXiv Detail & Related papers (2024-03-29T19:25:09Z) - Understanding Video Transformers for Segmentation: A Survey of
Application and Interpretability [10.180033230324561]
Recently, approaches in this research area shifted from concentrating on ConvNet-based to transformer-based models.
Various interpretability approaches have appeared for transformer models and video temporal dynamics.
arXiv Detail & Related papers (2023-10-18T19:58:25Z) - Meta-Transformer: A Unified Framework for Multimodal Learning [105.77219833997962]
Multimodal learning aims to build models that process and relate information from multiple modalities.
Despite years of development in this field, it still remains challenging to design a unified network for processing various modalities.
We propose a framework, named Meta-Transformer, that leverages a $textbffrozen$ encoder to perform multimodal perception.
arXiv Detail & Related papers (2023-07-20T12:10:29Z) - Interactive Image Segmentation with Cross-Modality Vision Transformers [18.075338835513993]
Cross-modality vision transformers exploits mutual information to better guide the learning process.
The stability of our method in term of avoiding failure cases shows its potential to be a practical annotation tool.
arXiv Detail & Related papers (2023-07-05T13:29:05Z) - Semantic Segmentation using Vision Transformers: A survey [0.0]
Convolutional neural networks (CNN) and Vision Transformers (ViTs) provide the architecture models for semantic segmentation.
ViTs have proven success in image classification, they cannot be directly applied to dense prediction tasks such as image segmentation and object detection.
This survey aims to review and compare the performances of ViT architectures designed for semantic segmentation using benchmarking datasets.
arXiv Detail & Related papers (2023-05-05T04:11:00Z) - AttEntropy: Segmenting Unknown Objects in Complex Scenes using the
Spatial Attention Entropy of Semantic Segmentation Transformers [99.22536338338011]
We study the spatial attentions of different backbone layers of semantic segmentation transformers.
We exploit this by extracting heatmaps that can be used to segment unknown objects within diverse backgrounds.
Our method is training-free and its computational overhead negligible.
arXiv Detail & Related papers (2022-12-29T18:07:56Z) - Segmenter: Transformer for Semantic Segmentation [79.9887988699159]
We introduce Segmenter, a transformer model for semantic segmentation.
We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation.
It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.
arXiv Detail & Related papers (2021-05-12T13:01:44Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.