Group Multi-View Transformer for 3D Shape Analysis with Spatial Encoding
- URL: http://arxiv.org/abs/2312.16477v3
- Date: Mon, 5 Aug 2024 05:51:21 GMT
- Title: Group Multi-View Transformer for 3D Shape Analysis with Spatial Encoding
- Authors: Lixiang Xu, Qingzhe Cui, Richang Hong, Wei Xu, Enhong Chen, Xin Yuan, Chenglong Li, Yuanyan Tang,
- Abstract summary: In recent years, the results of view-based 3D shape recognition methods have saturated, and models with excellent performance cannot be deployed on memory-limited devices.
We introduce a compression method based on knowledge distillation for this field, which largely reduces the number of parameters while preserving model performance as much as possible.
Specifically, to enhance the capabilities of smaller models, we design a high-performing large model called Group Multi-view Vision Transformer (GMViT)
The large model GMViT achieves excellent 3D classification and retrieval results on the benchmark datasets ModelNet, ShapeNetCore55, and MCB.
- Score: 81.1943823985213
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, the results of view-based 3D shape recognition methods have saturated, and models with excellent performance cannot be deployed on memory-limited devices due to their huge size of parameters. To address this problem, we introduce a compression method based on knowledge distillation for this field, which largely reduces the number of parameters while preserving model performance as much as possible. Specifically, to enhance the capabilities of smaller models, we design a high-performing large model called Group Multi-view Vision Transformer (GMViT). In GMViT, the view-level ViT first establishes relationships between view-level features. Additionally, to capture deeper features, we employ the grouping module to enhance view-level features into group-level features. Finally, the group-level ViT aggregates group-level features into complete, well-formed 3D shape descriptors. Notably, in both ViTs, we introduce spatial encoding of camera coordinates as innovative position embeddings. Furthermore, we propose two compressed versions based on GMViT, namely GMViT-simple and GMViT-mini. To enhance the training effectiveness of the small models, we introduce a knowledge distillation method throughout the GMViT process, where the key outputs of each GMViT component serve as distillation targets. Extensive experiments demonstrate the efficacy of the proposed method. The large model GMViT achieves excellent 3D classification and retrieval results on the benchmark datasets ModelNet, ShapeNetCore55, and MCB. The smaller models, GMViT-simple and GMViT-mini, reduce the parameter size by 8 and 17.6 times, respectively, and improve shape recognition speed by 1.5 times on average, while preserving at least 90% of the classification and retrieval performance. The code is available at https://github.com/bigdata-graph/GMViT.
Related papers
- Spatially Optimized Compact Deep Metric Learning Model for Similarity Search [1.0015171648915433]
Similarity search is a crucial task where spatial features decide an important output.
This study demonstrates that utilizing a single layer of involution feature extractor alongside a compact convolution model significantly enhances the performance of similarity search.
arXiv Detail & Related papers (2024-04-09T19:49:01Z) - Minimalist and High-Performance Semantic Segmentation with Plain Vision
Transformers [10.72362704573323]
We introduce the PlainSeg, a model comprising only three 3$times$3 convolutions in addition to the transformer layers.
We also present the PlainSeg-Hier, which allows for the utilization of hierarchical features.
arXiv Detail & Related papers (2023-10-19T14:01:40Z) - Generic-to-Specific Distillation of Masked Autoencoders [119.21281960831651]
We propose generic-to-specific distillation (G2SD) to tap the potential of small ViT models under the supervision of large models pre-trained by masked autoencoders.
With G2SD, the vanilla ViT-Small model achieves 98.7%, 98.1% and 99.3% the performance of its teacher for image classification, object detection, and semantic segmentation.
arXiv Detail & Related papers (2023-02-28T17:13:14Z) - GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group
Propagation [25.689520892609213]
We present a novel nonhierarchical (i.e. non-pyramidal) transformer model for general visual recognition with high-resolution features.
We evaluate GPViT on a variety of visual recognition tasks including image classification, semantic segmentation, object detection, and instance segmentation.
arXiv Detail & Related papers (2022-12-13T18:26:00Z) - ViTPose++: Vision Transformer for Generic Body Pose Estimation [70.86760562151163]
We show the surprisingly good properties of plain vision transformers for body pose estimation from various aspects.
ViTPose employs the plain and non-hierarchical vision transformer as an encoder to encode features and a lightweight decoder to decode body keypoints.
We empirically demonstrate that the knowledge of large ViTPose models can be easily transferred to small ones via a simple knowledge token.
arXiv Detail & Related papers (2022-12-07T12:33:28Z) - Unleashing Vanilla Vision Transformer with Masked Image Modeling for
Object Detection [39.37861288287621]
A MIM pre-trained vanilla ViT can work surprisingly well in the challenging object-level recognition scenario.
A random compact convolutional stem supplants the pre-trained large kernel patchify stem.
The proposed detector, named MIMDet, enables a MIM pre-trained vanilla ViT to outperform hierarchical Swin Transformer by 2.3 box AP and 2.5 mask AP on.
arXiv Detail & Related papers (2022-04-06T17:59:04Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.