Related papers: Group Multi-View Transformer for 3D Shape Analysis with Spatial Encoding

Group Multi-View Transformer for 3D Shape Analysis with Spatial Encoding

URL: http://arxiv.org/abs/2312.16477v2
Date: Sat, 30 Dec 2023 08:21:36 GMT
Title: Group Multi-View Transformer for 3D Shape Analysis with Spatial Encoding
Authors: Lixiang Xu, Qingzhe Cui, Richang Hong, Wei Xu, Enhong Chen, Xin Yuan, Chenglong Li, Yuanyan Tang
Abstract summary: In recent years, the results of view-based 3D shape recognition methods have saturated, and models with excellent performance cannot be deployed on memory-limited devices. We introduce a compression method based on knowledge distillation for this field, which largely reduces the number of parameters while preserving model performance as much as possible. Specifically, to enhance the capabilities of smaller models, we design a high-performing large model called Group Multi-view Vision Transformer (GMViT) The large model GMViT achieves excellent 3D classification and retrieval results on the benchmark datasets ModelNet, ShapeNetCore55, and MCB.
Score: 84.69144118699766
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, the results of view-based 3D shape recognition methods have saturated, and models with excellent performance cannot be deployed on memory-limited devices due to their huge size of parameters. To address this problem, we introduce a compression method based on knowledge distillation for this field, which largely reduces the number of parameters while preserving model performance as much as possible. Specifically, to enhance the capabilities of smaller models, we design a high-performing large model called Group Multi-view Vision Transformer (GMViT). In GMViT, the view-level ViT first establishes relationships between view-level features. Additionally, to capture deeper features, we employ the grouping module to enhance view-level features into group-level features. Finally, the group-level ViT aggregates group-level features into complete, well-formed 3D shape descriptors. Notably, in both ViTs, we introduce spatial encoding of camera coordinates as innovative position embeddings. Furthermore, we propose two compressed versions based on GMViT, namely GMViT-simple and GMViT-mini. To enhance the training effectiveness of the small models, we introduce a knowledge distillation method throughout the GMViT process, where the key outputs of each GMViT component serve as distillation targets. Extensive experiments demonstrate the efficacy of the proposed method. The large model GMViT achieves excellent 3D classification and retrieval results on the benchmark datasets ModelNet, ShapeNetCore55, and MCB. The smaller models, GMViT-simple and GMViT-mini, reduce the parameter size by 8 and 17.6 times, respectively, and improve shape recognition speed by 1.5 times on average, while preserving at least 90% of the classification and retrieval performance.

Related papers

Your ViT is Secretly an Image Segmentation Model [50.71238842539735]
Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. We show that inductive biases introduced by task-specific components can instead be learned by the ViT itself. We introduce the Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation.
arXiv Detail & Related papers (2025-03-24T19:56:02Z)
MSCViT: A Small-size ViT architecture with Multi-Scale Self-Attention Mechanism for Tiny Datasets [3.8601741392210434]
Vision Transformer (ViT) has demonstrated significant potential in various vision tasks due to its strong ability in modelling long-range dependencies. We present a small-size ViT architecture with multi-scale self-attention mechanism and convolution blocks to model different scales of attention. Our model achieves an accuracy of 84.68% on CIFAR-100 with 14.0M parameters and 2.5 GFLOPs, without pre-training on large datasets.
arXiv Detail & Related papers (2025-01-10T15:18:05Z)
MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model [87.71060849866093]
We introduce MVGenMaster, a multi-view diffusion model enhanced with 3D priors to address versatile Novel View Synthesis (NVS) tasks. Our model features a simple yet effective pipeline that can generate up to 100 novel views conditioned on variable reference views and camera poses. We present several training and model modifications to strengthen the model with scaled-up datasets.
arXiv Detail & Related papers (2024-11-25T07:34:23Z)
Spatially Optimized Compact Deep Metric Learning Model for Similarity Search [1.0015171648915433]
Similarity search is a crucial task where spatial features decide an important output. This study demonstrates that utilizing a single layer of involution feature extractor alongside a compact convolution model significantly enhances the performance of similarity search.
arXiv Detail & Related papers (2024-04-09T19:49:01Z)
Minimalist and High-Performance Semantic Segmentation with Plain Vision Transformers [10.72362704573323]
We introduce the PlainSeg, a model comprising only three 3$times$3 convolutions in addition to the transformer layers. We also present the PlainSeg-Hier, which allows for the utilization of hierarchical features.
arXiv Detail & Related papers (2023-10-19T14:01:40Z)
Generic-to-Specific Distillation of Masked Autoencoders [119.21281960831651]
We propose generic-to-specific distillation (G2SD) to tap the potential of small ViT models under the supervision of large models pre-trained by masked autoencoders. With G2SD, the vanilla ViT-Small model achieves 98.7%, 98.1% and 99.3% the performance of its teacher for image classification, object detection, and semantic segmentation.
arXiv Detail & Related papers (2023-02-28T17:13:14Z)
GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation [25.689520892609213]
We present a novel nonhierarchical (i.e. non-pyramidal) transformer model for general visual recognition with high-resolution features. We evaluate GPViT on a variety of visual recognition tasks including image classification, semantic segmentation, object detection, and instance segmentation.
arXiv Detail & Related papers (2022-12-13T18:26:00Z)
Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection [39.37861288287621]
A MIM pre-trained vanilla ViT can work surprisingly well in the challenging object-level recognition scenario. A random compact convolutional stem supplants the pre-trained large kernel patchify stem. The proposed detector, named MIMDet, enables a MIM pre-trained vanilla ViT to outperform hierarchical Swin Transformer by 2.3 box AP and 2.5 mask AP on.
arXiv Detail & Related papers (2022-04-06T17:59:04Z)
ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE. ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context. We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z)
Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer' With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.