Token Adaptation via Side Graph Convolution for Temporally and Spatially Efficient Fine-tuning of 3D Point Cloud Transformers
- URL: http://arxiv.org/abs/2502.14142v1
- Date: Wed, 19 Feb 2025 22:58:56 GMT
- Title: Token Adaptation via Side Graph Convolution for Temporally and Spatially Efficient Fine-tuning of 3D Point Cloud Transformers
- Authors: Takahiko Furuya,
- Abstract summary: This paper proposes a novel PEFT algorithm for 3D point cloud Transformers, called Side Token Adaptation on a neighborhood Graph (STAG)
STAG employs a graph convolutional side network that operates in parallel with a frozen backbone Transformer to adapt tokens to downstream tasks.
We present Point Cloud Classification 13 (PCC13), a new benchmark comprising diverse publicly available 3D point cloud datasets.
- Score: 1.19658449368018
- License:
- Abstract: Parameter-efficient fine-tuning (PEFT) of pre-trained 3D point cloud Transformers has emerged as a promising technique for 3D point cloud analysis. While existing PEFT methods attempt to minimize the number of tunable parameters, they still suffer from high temporal and spatial computational costs during fine-tuning. This paper proposes a novel PEFT algorithm for 3D point cloud Transformers, called Side Token Adaptation on a neighborhood Graph (STAG), to achieve superior temporal and spatial efficiency. STAG employs a graph convolutional side network that operates in parallel with a frozen backbone Transformer to adapt tokens to downstream tasks. STAG's side network realizes high efficiency through three key components: connection with the backbone that enables reduced gradient computation, parameter sharing framework, and efficient graph convolution. Furthermore, we present Point Cloud Classification 13 (PCC13), a new benchmark comprising diverse publicly available 3D point cloud datasets, enabling comprehensive evaluation of PEFT methods. Extensive experiments using multiple pre-trained models and PCC13 demonstrates the effectiveness of STAG. Specifically, STAG maintains classification accuracy comparable to existing methods while reducing tunable parameters to only 0.43M and achieving significant reductions in both computational time and memory consumption for fine-tuning. Code and benchmark will be available at: https://github.com/takahikof/STAG
Related papers
- Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference [14.030836300221756]
textbfSparse-Tuning is a novel PEFT method that accounts for the information redundancy in images and videos.
Sparse-Tuning minimizes the quantity of tokens processed at each layer, leading to a quadratic reduction in computational and memory overhead.
Our results show that our Sparse-Tuning reduces GFLOPs to textbf62%-70% of the original ViT-B while achieving state-of-the-art performance.
arXiv Detail & Related papers (2024-05-23T15:34:53Z) - Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis [51.14136878142034]
Point cloud analysis has achieved outstanding performance by transferring point cloud pre-trained models.
Existing methods for model adaptation usually update all model parameters, which is inefficient as it relies on high computational costs.
In this paper, we aim to study parameter-efficient transfer learning for point cloud analysis with an ideal trade-off between task performance and parameter efficiency.
arXiv Detail & Related papers (2024-03-03T08:25:04Z) - Adaptive Point Transformer [88.28498667506165]
Adaptive Point Cloud Transformer (AdaPT) is a standard PT model augmented by an adaptive token selection mechanism.
AdaPT dynamically reduces the number of tokens during inference, enabling efficient processing of large point clouds.
arXiv Detail & Related papers (2024-01-26T13:24:45Z) - Point-PEFT: Parameter-Efficient Fine-Tuning for 3D Pre-trained Models [46.42092771753465]
We introduce Point-PEFT, a novel framework for adapting point cloud pre-trained models with minimal learnable parameters.
Specifically, for a pre-trained 3D model, we freeze most of its parameters, and only tune the newly added PEFT modules on downstream tasks.
arXiv Detail & Related papers (2023-10-04T16:49:36Z) - DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets [95.84755169585492]
We present Dynamic Sparse Voxel Transformer (DSVT), a single-stride window-based voxel Transformer backbone for outdoor 3D perception.
Our model achieves state-of-the-art performance with a broad range of 3D perception tasks.
arXiv Detail & Related papers (2023-01-15T09:31:58Z) - AdaPoinTr: Diverse Point Cloud Completion with Adaptive Geometry-Aware
Transformers [94.11915008006483]
We present a new method that reformulates point cloud completion as a set-to-set translation problem.
We design a new model, called PoinTr, which adopts a Transformer encoder-decoder architecture for point cloud completion.
Our method attains 6.53 CD on PCN, 0.81 CD on ShapeNet-55 and 0.392 MMD on real-world KITTI.
arXiv Detail & Related papers (2023-01-11T16:14:12Z) - EPCL: Frozen CLIP Transformer is An Efficient Point Cloud Encoder [60.52613206271329]
This paper introduces textbfEfficient textbfPoint textbfCloud textbfLearning (EPCL) for training high-quality point cloud models with a frozen CLIP transformer.
Our EPCL connects the 2D and 3D modalities by semantically aligning the image features and point cloud features without paired 2D-3D data.
arXiv Detail & Related papers (2022-12-08T06:27:11Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - Point-Voxel Transformer: An Efficient Approach To 3D Deep Learning [5.236787242129767]
We present a novel 3D Transformer, called Point-Voxel Transformer (PVT) that leverages self-attention computation in points to gather global context features.
Our method fully exploits the potentials of Transformer architecture, paving the road to efficient and accurate recognition results.
arXiv Detail & Related papers (2021-08-13T06:07:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.