Transformer-Based Visual Segmentation: A Survey
- URL: http://arxiv.org/abs/2304.09854v3
- Date: Wed, 20 Dec 2023 05:21:20 GMT
- Title: Transformer-Based Visual Segmentation: A Survey
- Authors: Xiangtai Li, Henghui Ding, Haobo Yuan, Wenwei Zhang, Jiangmiao Pang,
Guangliang Cheng, Kai Chen, Ziwei Liu, Chen Change Loy
- Abstract summary: Visual segmentation seeks to partition images, video frames, or point clouds into multiple segments or groups.
Transformers are a type of neural network based on self-attention originally designed for natural language processing.
Transformers offer robust, unified, and even simpler solutions for various segmentation tasks.
- Score: 122.45372317618309
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual segmentation seeks to partition images, video frames, or point clouds
into multiple segments or groups. This technique has numerous real-world
applications, such as autonomous driving, image editing, robot sensing, and
medical analysis. Over the past decade, deep learning-based methods have made
remarkable strides in this area. Recently, transformers, a type of neural
network based on self-attention originally designed for natural language
processing, have considerably surpassed previous convolutional or recurrent
approaches in various vision processing tasks. Specifically, vision
transformers offer robust, unified, and even simpler solutions for various
segmentation tasks. This survey provides a thorough overview of
transformer-based visual segmentation, summarizing recent advancements. We
first review the background, encompassing problem definitions, datasets, and
prior convolutional methods. Next, we summarize a meta-architecture that
unifies all recent transformer-based approaches. Based on this
meta-architecture, we examine various method designs, including modifications
to the meta-architecture and associated applications. We also present several
closely related settings, including 3D point cloud segmentation, foundation
model tuning, domain-aware segmentation, efficient segmentation, and medical
segmentation. Additionally, we compile and re-evaluate the reviewed methods on
several well-established datasets. Finally, we identify open challenges in this
field and propose directions for future research. The project page can be found
at https://github.com/lxtGH/Awesome-Segmentation-With-Transformer. We will also
continually monitor developments in this rapidly evolving field.
Related papers
- AgileFormer: Spatially Agile Transformer UNet for Medical Image Segmentation [1.657223496316251]
We argue that the current design of the vision transformer-based UNet (ViT-UNet) segmentation models may not effectively handle the heterogeneous appearance.
We present a structured approach to introduce spatially dynamic components to the ViT-UNet.
This adaptation enables the model to effectively capture features of target objects with diverse appearances.
arXiv Detail & Related papers (2024-03-29T19:25:09Z) - Understanding Video Transformers for Segmentation: A Survey of
Application and Interpretability [10.180033230324561]
Recently, approaches in this research area shifted from concentrating on ConvNet-based to transformer-based models.
Various interpretability approaches have appeared for transformer models and video temporal dynamics.
arXiv Detail & Related papers (2023-10-18T19:58:25Z) - Meta-Transformer: A Unified Framework for Multimodal Learning [105.77219833997962]
Multimodal learning aims to build models that process and relate information from multiple modalities.
Despite years of development in this field, it still remains challenging to design a unified network for processing various modalities.
We propose a framework, named Meta-Transformer, that leverages a $textbffrozen$ encoder to perform multimodal perception.
arXiv Detail & Related papers (2023-07-20T12:10:29Z) - Interactive Image Segmentation with Cross-Modality Vision Transformers [18.075338835513993]
Cross-modality vision transformers exploits mutual information to better guide the learning process.
The stability of our method in term of avoiding failure cases shows its potential to be a practical annotation tool.
arXiv Detail & Related papers (2023-07-05T13:29:05Z) - Semantic Segmentation using Vision Transformers: A survey [0.0]
Convolutional neural networks (CNN) and Vision Transformers (ViTs) provide the architecture models for semantic segmentation.
ViTs have proven success in image classification, they cannot be directly applied to dense prediction tasks such as image segmentation and object detection.
This survey aims to review and compare the performances of ViT architectures designed for semantic segmentation using benchmarking datasets.
arXiv Detail & Related papers (2023-05-05T04:11:00Z) - AttEntropy: Segmenting Unknown Objects in Complex Scenes using the
Spatial Attention Entropy of Semantic Segmentation Transformers [99.22536338338011]
We study the spatial attentions of different backbone layers of semantic segmentation transformers.
We exploit this by extracting heatmaps that can be used to segment unknown objects within diverse backgrounds.
Our method is training-free and its computational overhead negligible.
arXiv Detail & Related papers (2022-12-29T18:07:56Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [144.38869017091199]
Vision transformers (ViTs) in image classification have shifted the methodologies for visual representation learning.
In this work, we explore the global context learning potentials of ViTs for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - Segmenter: Transformer for Semantic Segmentation [79.9887988699159]
We introduce Segmenter, a transformer model for semantic segmentation.
We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation.
It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.
arXiv Detail & Related papers (2021-05-12T13:01:44Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.