GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group
Propagation
- URL: http://arxiv.org/abs/2212.06795v2
- Date: Tue, 25 Apr 2023 09:08:55 GMT
- Title: GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group
Propagation
- Authors: Chenhongyi Yang, Jiarui Xu, Shalini De Mello, Elliot J. Crowley,
Xiaolong Wang
- Abstract summary: We present a novel nonhierarchical (i.e. non-pyramidal) transformer model for general visual recognition with high-resolution features.
We evaluate GPViT on a variety of visual recognition tasks including image classification, semantic segmentation, object detection, and instance segmentation.
- Score: 25.689520892609213
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present the Group Propagation Vision Transformer (GPViT): a novel
nonhierarchical (i.e. non-pyramidal) transformer model designed for general
visual recognition with high-resolution features. High-resolution features (or
tokens) are a natural fit for tasks that involve perceiving fine-grained
details such as detection and segmentation, but exchanging global information
between these features is expensive in memory and computation because of the
way self-attention scales. We provide a highly efficient alternative Group
Propagation Block (GP Block) to exchange global information. In each GP Block,
features are first grouped together by a fixed number of learnable group
tokens; we then perform Group Propagation where global information is exchanged
between the grouped features; finally, global information in the updated
grouped features is returned back to the image features through a transformer
decoder. We evaluate GPViT on a variety of visual recognition tasks including
image classification, semantic segmentation, object detection, and instance
segmentation. Our method achieves significant performance gains over previous
works across all tasks, especially on tasks that require highresolution
outputs, for example, our GPViT-L3 outperforms Swin Transformer-B by 2.0 mIoU
on ADE20K semantic segmentation with only half as many parameters. Project
page: chenhongyiyang.com/projects/GPViT/GPViT
Related papers
- HiFiSeg: High-Frequency Information Enhanced Polyp Segmentation with Global-Local Vision Transformer [5.96521715927858]
HiFiSeg is a novel network for colon polyp segmentation that enhances high-frequency information processing.
GLIM employs a parallel structure to fuse global and local information at multiple scales, effectively capturing fine-grained features.
SAM selectively integrates boundary details from low-level features with semantic information from high-level features, significantly improving the model's ability to accurately detect and segment polyps.
arXiv Detail & Related papers (2024-10-03T14:36:22Z) - Other Tokens Matter: Exploring Global and Local Features of Vision Transformers for Object Re-Identification [63.147482497821166]
We first explore the influence of global and local features of ViT and then propose a novel Global-Local Transformer (GLTrans) for high-performance object Re-ID.
Our proposed method achieves superior performance on four object Re-ID benchmarks.
arXiv Detail & Related papers (2024-04-23T12:42:07Z) - GRA: Detecting Oriented Objects through Group-wise Rotating and Attention [64.21917568525764]
Group-wise Rotating and Attention (GRA) module is proposed to replace the convolution operations in backbone networks for oriented object detection.
GRA can adaptively capture fine-grained features of objects with diverse orientations, comprising two key components: Group-wise Rotating and Group-wise Attention.
GRA achieves a new state-of-the-art (SOTA) on the DOTA-v2.0 benchmark, while saving the parameters by nearly 50% compared to the previous SOTA method.
arXiv Detail & Related papers (2024-03-17T07:29:32Z) - Leveraging Swin Transformer for Local-to-Global Weakly Supervised
Semantic Segmentation [12.103012959947055]
This work explores the use of Swin Transformer by proposing "SWTformer" to enhance the accuracy of the initial seed CAMs.
SWTformer-V1 achieves a 0.98% mAP higher localization accuracy, outperforming state-of-the-art models.
SWTformer-V2 incorporates a multi-scale feature fusion mechanism to extract additional information.
arXiv Detail & Related papers (2024-01-31T13:41:17Z) - Group Multi-View Transformer for 3D Shape Analysis with Spatial Encoding [81.1943823985213]
In recent years, the results of view-based 3D shape recognition methods have saturated, and models with excellent performance cannot be deployed on memory-limited devices.
We introduce a compression method based on knowledge distillation for this field, which largely reduces the number of parameters while preserving model performance as much as possible.
Specifically, to enhance the capabilities of smaller models, we design a high-performing large model called Group Multi-view Vision Transformer (GMViT)
The large model GMViT achieves excellent 3D classification and retrieval results on the benchmark datasets ModelNet, ShapeNetCore55, and MCB.
arXiv Detail & Related papers (2023-12-27T08:52:41Z) - ZJU ReLER Submission for EPIC-KITCHEN Challenge 2023: Semi-Supervised
Video Object Segmentation [62.98078087018469]
We introduce MSDeAOT, a variant of the AOT framework that incorporates transformers at multiple feature scales.
MSDeAOT efficiently propagates object masks from previous frames to the current frame using a feature scale with a stride of 16.
We also employ GPM in a more refined feature scale with a stride of 8, leading to improved accuracy in detecting and tracking small objects.
arXiv Detail & Related papers (2023-07-05T03:43:15Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - Transformer Scale Gate for Semantic Segmentation [53.27673119360868]
Transformer Scale Gate (TSG) exploits cues in self and cross attentions in Vision Transformers for the scale selection.
Our experiments on the Pascal Context and ADE20K datasets demonstrate that our feature selection strategy achieves consistent gains.
arXiv Detail & Related papers (2022-05-14T13:11:39Z) - GroupTransNet: Group Transformer Network for RGB-D Salient Object
Detection [5.876499671899904]
We propose a novel Group Transformer Network (GroupTransNet) for RGB-D salient object detection.
GroupTransNet is good at learning the long-range dependencies of cross layer features.
Experiments demonstrate that GroupTransNet outperforms comparison models.
arXiv Detail & Related papers (2022-03-21T08:00:16Z) - RGBT Tracking via Multi-Adapter Network with Hierarchical Divergence
Loss [37.99375824040946]
We propose a novel multi-adapter network to jointly perform modality-shared, modality-specific and instance-aware target representation learning.
Experiments on two RGBT tracking benchmark datasets demonstrate the outstanding performance of the proposed tracker.
arXiv Detail & Related papers (2020-11-14T01:50:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.