Related papers: TrackAny3D: Transferring Pretrained 3D Models for Category-unified 3D Point Cloud Tracking

TrackAny3D: Transferring Pretrained 3D Models for Category-unified 3D Point Cloud Tracking

URL: http://arxiv.org/abs/2507.19908v1
Date: Sat, 26 Jul 2025 10:41:55 GMT
Title: TrackAny3D: Transferring Pretrained 3D Models for Category-unified 3D Point Cloud Tracking
Authors: Mengmeng Wang, Haonan Wang, Yulong Li, Xiangjie Kong, Jiaxin Du, Guojiang Shen, Feng Xia,
Abstract summary: TrackAny3D is the first framework to transfer large-scale pretrained 3D models for category-agnostic 3D SOT.<n>MoGE architecture adaptively activates specialized 3works based on distinct geometric characteristics.<n>Experiments show that TrackAny3D establishes new state-of-the-art performance on category-agnostic 3D SOT.
Score: 25.788917457593673
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: 3D LiDAR-based single object tracking (SOT) relies on sparse and irregular point clouds, posing challenges from geometric variations in scale, motion patterns, and structural complexity across object categories. Current category-specific approaches achieve good accuracy but are impractical for real-world use, requiring separate models for each category and showing limited generalization. To tackle these issues, we propose TrackAny3D, the first framework to transfer large-scale pretrained 3D models for category-agnostic 3D SOT. We first integrate parameter-efficient adapters to bridge the gap between pretraining and tracking tasks while preserving geometric priors. Then, we introduce a Mixture-of-Geometry-Experts (MoGE) architecture that adaptively activates specialized subnetworks based on distinct geometric characteristics. Additionally, we design a temporal context optimization strategy that incorporates learnable temporal tokens and a dynamic mask weighting module to propagate historical information and mitigate temporal drift. Experiments on three commonly-used benchmarks show that TrackAny3D establishes new state-of-the-art performance on category-agnostic 3D SOT, demonstrating strong generalization and competitiveness. We hope this work will enlighten the community on the importance of unified models and further expand the use of large-scale pretrained models in this field.

Related papers

SeqAffordSplat: Scene-level Sequential Affordance Reasoning on 3D Gaussian Splatting [85.87902260102652]
We introduce the novel task of Sequential 3D Gaussian Affordance Reasoning.<n>We then propose SeqSplatNet, an end-to-end framework that directly maps an instruction to a sequence of 3D affordance masks.<n>Our method sets a new state-of-the-art on our challenging benchmark, effectively advancing affordance reasoning from single-step interactions to complex, sequential tasks at the scene level.
arXiv Detail & Related papers (2025-07-31T17:56:55Z)
UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting [64.31900521467362]
No existing pre-training method is equally effective for both object- and scene-level point clouds.<n>We introduce UniPre3D, the first unified pre-training method that can be seamlessly applied to point clouds of any scale and 3D models of any architecture.
arXiv Detail & Related papers (2025-06-11T17:23:21Z)
On Geometry-Enhanced Parameter-Efficient Fine-Tuning for 3D Scene Segmentation [52.96632954620623]
We introduce a novel geometry-aware PEFT module specifically designed for 3D point cloud transformers.<n>Our approach sets a new benchmark for efficient, scalable, and geometry-aware fine-tuning of large-scale 3D point cloud models.
arXiv Detail & Related papers (2025-05-28T15:08:36Z)
Proto-FG3D: Prototype-based Interpretable Fine-Grained 3D Shape Classification [59.68055837500357]
We propose the first prototype-based framework named Proto-FG3D for fine-grained 3D shape classification.<n>Proto-FG3D establishes joint multi-view and multi-category representation learning via Prototype Association.<n>Proto-FG3D surpasses state-of-the-art methods in accuracy, transparent predictions, and ad-hoc interpretability with visualizations.
arXiv Detail & Related papers (2025-05-23T09:31:02Z)
MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion [118.74385965694694]
We present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes.<n>By simply estimating a pointmap for each timestep, we can effectively adapt DUST3R's representation, previously only used for static scenes, to dynamic scenes.<n>We show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics.
arXiv Detail & Related papers (2024-10-04T18:00:07Z)
Towards Category Unification of 3D Single Object Tracking on Point Clouds [10.64650098374183]
Category-specific models are provenly valuable methods in 3D single object tracking (SOT) regardless of Siamese or motion-centric paradigms. This paper first introduces unified models that can simultaneously track objects across all categories using a single network with shared model parameters.
arXiv Detail & Related papers (2024-01-20T10:38:28Z)
Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR-based Perception [122.53774221136193]
State-of-the-art methods for driving-scene LiDAR-based perception often project the point clouds to 2D space and then process them via 2D convolution. A natural remedy is to utilize the 3D voxelization and 3D convolution network. We propose a new framework for the outdoor LiDAR segmentation, where cylindrical partition and asymmetrical 3D convolution networks are designed to explore the 3D geometric pattern.
arXiv Detail & Related papers (2021-09-12T06:25:11Z)
BundleTrack: 6D Pose Tracking for Novel Objects without Instance or Category-Level 3D Models [1.14219428942199]
This work proposes BundleTrack, a general framework for 6D pose tracking of objects. An efficient implementation of the framework provides a real-time performance of 10Hz for the entire framework.
arXiv Detail & Related papers (2021-08-01T18:14:46Z)
Learning Compositional Shape Priors for Few-Shot 3D Reconstruction [36.40776735291117]
We show that complex encoder-decoder architectures exploit large amounts of per-category data. We propose three ways to learn a class-specific global shape prior, directly from data. Experiments on the popular ShapeNet dataset show that our method outperforms a zero-shot baseline by over 40%.
arXiv Detail & Related papers (2021-06-11T14:55:49Z)
Monocular 3D Detection with Geometric Constraints Embedding and Semi-supervised Training [3.8073142980733]
We propose a novel framework for monocular 3D objects detection using only RGB images, called KM3D-Net. We design a fully convolutional model to predict object keypoints, dimension, and orientation, and then combine these estimations with perspective geometry constraints to compute position attribute.
arXiv Detail & Related papers (2020-09-02T00:51:51Z)
Few-Shot Single-View 3-D Object Reconstruction with Compositional Priors [30.262308825799167]
We show that complex encoder-decoder architectures perform similarly to nearest-neighbor baselines in standard benchmarks. We propose three approaches that efficiently integrate a class prior into a 3D reconstruction model.
arXiv Detail & Related papers (2020-04-14T04:53:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.