Related papers: Transformer-Based Visual Segmentation: A Survey

Transformer-Based Visual Segmentation: A Survey

URL: http://arxiv.org/abs/2304.09854v4
Date: Sun, 4 Aug 2024 04:30:45 GMT
Title: Transformer-Based Visual Segmentation: A Survey
Authors: Xiangtai Li, Henghui Ding, Haobo Yuan, Wenwei Zhang, Jiangmiao Pang, Guangliang Cheng, Kai Chen, Ziwei Liu, Chen Change Loy,
Abstract summary: Visual segmentation seeks to partition images, video frames, or point clouds into multiple segments or groups. Transformers are a type of neural network based on self-attention originally designed for natural language processing. Transformers offer robust, unified, and even simpler solutions for various segmentation tasks.
Score: 118.01564082499948
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual segmentation seeks to partition images, video frames, or point clouds into multiple segments or groups. This technique has numerous real-world applications, such as autonomous driving, image editing, robot sensing, and medical analysis. Over the past decade, deep learning-based methods have made remarkable strides in this area. Recently, transformers, a type of neural network based on self-attention originally designed for natural language processing, have considerably surpassed previous convolutional or recurrent approaches in various vision processing tasks. Specifically, vision transformers offer robust, unified, and even simpler solutions for various segmentation tasks. This survey provides a thorough overview of transformer-based visual segmentation, summarizing recent advancements. We first review the background, encompassing problem definitions, datasets, and prior convolutional methods. Next, we summarize a meta-architecture that unifies all recent transformer-based approaches. Based on this meta-architecture, we examine various method designs, including modifications to the meta-architecture and associated applications. We also present several closely related settings, including 3D point cloud segmentation, foundation model tuning, domain-aware segmentation, efficient segmentation, and medical segmentation. Additionally, we compile and re-evaluate the reviewed methods on several well-established datasets. Finally, we identify open challenges in this field and propose directions for future research. The project page can be found at https://github.com/lxtGH/Awesome-Segmentation-With-Transformer. We will also continually monitor developments in this rapidly evolving field.

Related papers

Multimodal Referring Segmentation: A Survey [93.24051010753817]
Multimodal referring segmentation aims to segment target objects in visual scenes, such as images, videos, and 3D scenes, based on referring expressions in text or audio format.<n>Over the past decade, it has gained significant attention in the multimodal community, driven by advances in convolutional neural networks, transformers, and large language models.
arXiv Detail & Related papers (2025-08-01T02:14:00Z)
M$^3$-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation [51.82272563578793]
We introduce the concept of phase in segmentation, which categorizes real-world objects based on their visual characteristics and potential morphological and appearance changes. We present a new benchmark, Multi-Phase, Multi-Transition and Multi-Scenery Video Object (M$3$-VOS), to verify the ability of models to understand object phases. We propose ReVOS, a new plug-andplay model that improves its performance by reversal refinement.
arXiv Detail & Related papers (2024-12-18T12:50:11Z)
Image Segmentation in Foundation Model Era: A Survey [99.19456390358211]
Current research in image segmentation lacks a detailed analysis of distinct characteristics, challenges, and solutions associated with these advancements. This survey seeks to fill this gap by providing a thorough review of cutting-edge research centered around FM-driven image segmentation. An exhaustive overview of over 300 segmentation approaches is provided to encapsulate the breadth of current research efforts.
arXiv Detail & Related papers (2024-08-23T10:07:59Z)
AgileFormer: Spatially Agile Transformer UNet for Medical Image Segmentation [1.657223496316251]
We argue that the current design of the vision transformer-based UNet (ViT-UNet) segmentation models may not effectively handle the heterogeneous appearance. We present a structured approach to introduce spatially dynamic components to the ViT-UNet. This adaptation enables the model to effectively capture features of target objects with diverse appearances.
arXiv Detail & Related papers (2024-03-29T19:25:09Z)
Understanding Video Transformers for Segmentation: A Survey of Application and Interpretability [10.180033230324561]
Recently, approaches in this research area shifted from concentrating on ConvNet-based to transformer-based models. Various interpretability approaches have appeared for transformer models and video temporal dynamics.
arXiv Detail & Related papers (2023-10-18T19:58:25Z)
Meta-Transformer: A Unified Framework for Multimodal Learning [105.77219833997962]
Multimodal learning aims to build models that process and relate information from multiple modalities. Despite years of development in this field, it still remains challenging to design a unified network for processing various modalities. We propose a framework, named Meta-Transformer, that leverages a $textbffrozen$ encoder to perform multimodal perception.
arXiv Detail & Related papers (2023-07-20T12:10:29Z)
Interactive Image Segmentation with Cross-Modality Vision Transformers [18.075338835513993]
Cross-modality vision transformers exploits mutual information to better guide the learning process. The stability of our method in term of avoiding failure cases shows its potential to be a practical annotation tool.
arXiv Detail & Related papers (2023-07-05T13:29:05Z)
Semantic Segmentation using Vision Transformers: A survey [0.0]
Convolutional neural networks (CNN) and Vision Transformers (ViTs) provide the architecture models for semantic segmentation. ViTs have proven success in image classification, they cannot be directly applied to dense prediction tasks such as image segmentation and object detection. This survey aims to review and compare the performances of ViT architectures designed for semantic segmentation using benchmarking datasets.
arXiv Detail & Related papers (2023-05-05T04:11:00Z)
Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction. Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information. We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z)
Segmenter: Transformer for Semantic Segmentation [79.9887988699159]
We introduce Segmenter, a transformer model for semantic segmentation. We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation. It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.
arXiv Detail & Related papers (2021-05-12T13:01:44Z)
Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence. Transformers require minimal inductive biases for their design and are naturally suited as set-functions. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.