How Many Tokens Do 3D Point Cloud Transformer Architectures Really Need?
- URL: http://arxiv.org/abs/2511.05449v1
- Date: Fri, 07 Nov 2025 17:38:01 GMT
- Title: How Many Tokens Do 3D Point Cloud Transformer Architectures Really Need?
- Authors: Tuan Anh Tran, Duy M. H. Nguyen, Hoai-Chau Tran, Michael Barz, Khoa D. Doan, Roger Wattenhofer, Ngo Anh Vien, Mathias Niepert, Daniel Sonntag, Paul Swoboda,
- Abstract summary: We present the finding that tokens are remarkably redundant, leading to substantial inefficiency.<n>We introduce gitmerge3D, a globally informed graph token merging method that can reduce the token count by up to 90-95%.<n>This work is the first to assess redundancy in large-scale 3D transformer models, providing insights into the development of more efficient 3D foundation architectures.
- Score: 56.09721366421187
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in 3D point cloud transformers have led to state-of-the-art results in tasks such as semantic segmentation and reconstruction. However, these models typically rely on dense token representations, incurring high computational and memory costs during training and inference. In this work, we present the finding that tokens are remarkably redundant, leading to substantial inefficiency. We introduce gitmerge3D, a globally informed graph token merging method that can reduce the token count by up to 90-95% while maintaining competitive performance. This finding challenges the prevailing assumption that more tokens inherently yield better performance and highlights that many current models are over-tokenized and under-optimized for scalability. We validate our method across multiple 3D vision tasks and show consistent improvements in computational efficiency. This work is the first to assess redundancy in large-scale 3D transformer models, providing insights into the development of more efficient 3D foundation architectures. Our code and checkpoints are publicly available at https://gitmerge3d.github.io
Related papers
- Token-UNet: A New Case for Transformers Integration in Efficient and Interpretable 3D UNets for Brain Imaging Segmentation [0.04117494580521492]
We present Token-UNet, adopting the TokenLearner and TokenFuser modules to encase Transformers into UNets.<n>Transformers have enabled global interactions among input elements in medical imaging, but current computational challenges hinder their deployment on common hardware.<n>We show this tokenization effectively encodes task-relevant information, yielding naturally interpretable attention maps.
arXiv Detail & Related papers (2026-02-23T16:15:38Z) - PatchAlign3D: Local Feature Alignment for Dense 3D Shape understanding [67.15800065888887]
Current foundation models for 3D shapes excel at global tasks (retrieval, classification) but transfer poorly to local part-level reasoning.<n>We introduce an encoder-only 3D model that produces language-aligned patch-level features directly from point clouds.<n>Our 3D encoder achieves zero-shot 3D part segmentation with fast single-pass inference without any test-time multi-view rendering.
arXiv Detail & Related papers (2026-01-05T18:55:45Z) - H$_{2}$OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers [124.11648300910444]
We present a hierarchical plug-and-play pruning-and-$-recovering framework, called Hierarchical Hourglass Tokenizer (H$_2$OT)<n>Our method is general-purpose: it can be easily incorporated into common VPT models on both seq2seq and seq2frame pipelines.
arXiv Detail & Related papers (2025-09-08T17:59:59Z) - FastVGGT: Training-Free Acceleration of Visual Geometry Transformer [83.67766078575782]
VGGT is a state-of-the-art feed-forward visual geometry model.<n>We propose FastVGGT, which leverages token merging in the 3D domain through a training-free mechanism for accelerating VGGT.<n>With 1000 input images, FastVGGT achieves a 4x speedup over VGGT while mitigating error accumulation in long-sequence scenarios.
arXiv Detail & Related papers (2025-09-02T17:54:21Z) - Fast3D: Accelerating 3D Multi-modal Large Language Models for Efficient 3D Scene Understanding [24.964149224068027]
We propose Fast3D, a plug-and-play visual token pruning framework for 3D MLLMs.<n>Global Attention Prediction (GAP) learns to predict the global attention distributions of the target model, enabling efficient token importance estimation.<n>SAP, introduces dynamic token budgets through attention-based complexity assessment, automatically adjusting layer-wise pruning ratios.
arXiv Detail & Related papers (2025-07-12T16:29:02Z) - Pts3D-LLM: Studying the Impact of Token Structure for 3D Scene Understanding With Large Language Models [9.658828841170472]
This work presents a rigorous study of 3D token structures, systematically comparing video-based and point-based representations.<n>We propose a novel approach that enriches visual tokens by incorporating 3D point cloud features from a Sonata pretrained Point Transformer V3 encoder.
arXiv Detail & Related papers (2025-06-06T02:35:26Z) - Principles of Visual Tokens for Efficient Video Understanding [36.05950369461622]
We propose a lightweight video model, LITE, that can select a small number of tokens effectively.<n>We show that LITE generalizes across datasets and even other tasks without the need for retraining.
arXiv Detail & Related papers (2024-11-20T14:09:47Z) - Efficient Point Transformer with Dynamic Token Aggregating for LiDAR Point Cloud Processing [19.73918716354272]
LiDAR point cloud processing and analysis have made great progress due to the development of 3D Transformers.<n>Existing 3D Transformer methods usually are computationally expensive and inefficient due to their huge and redundant attention maps.<n>We propose an efficient point TransFormer with Dynamic Token Aggregating (DTA-Former) for point cloud representation and processing.
arXiv Detail & Related papers (2024-05-23T20:50:50Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z) - Not All Tokens Are Equal: Human-centric Visual Analysis via Token
Clustering Transformer [91.49837514935051]
We propose a novel Vision Transformer, called Token Clustering Transformer (TCFormer)
TCFormer merges tokens by progressive clustering, where the tokens can be merged from different locations with flexible shapes and sizes.
Experiments show that TCFormer consistently outperforms its counterparts on different challenging human-centric tasks and datasets.
arXiv Detail & Related papers (2022-04-19T05:38:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.