PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models
- URL: http://arxiv.org/abs/2603.00412v1
- Date: Sat, 28 Feb 2026 02:17:46 GMT
- Title: PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models
- Authors: Yuanhao Su, Shaofeng Zhang, Xiaosong Jia, Qi Fan,
- Abstract summary: Existing methods rely solely on next-token prediction loss, using only language tokens for supervision.<n>mname explicitly supervises intermediate point cloud tokens to preserve fine-grained 3D geometric-semantic information.<n>Experiments on ModelNet40 and averse datasets demonstrate that our method achieves textbf2.08 pp improvement on average for classification tasks.
- Score: 23.263895549689863
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The development of 3D Vision-Language Models (VLMs), crucial for applications in robotics, autonomous driving, and augmented reality, is severely constrained by the scarcity of paired 3D-text data. Existing methods rely solely on next-token prediction loss, using only language tokens for supervision. This results in inefficient utilization of limited 3D data and leads to a significant degradation and loss of valuable geometric information in intermediate representations. To address these limitations, we propose {\mname}, a novel feature-level alignment regularization method. {\mname} explicitly supervises intermediate point cloud tokens to preserve fine-grained 3D geometric-semantic information throughout the language modeling process. Specifically, we constrain the intermediate point cloud tokens within the LLM to align with visual input tokens via a consistency loss. By training only a lightweight alignment projector and LoRA adapters, {\mname} achieves explicit feature-level supervision with minimal computational overhead, effectively preventing geometric degradation. Extensive experiments on ModelNet40 and Objaverse datasets demonstrate that our method achieves \textbf{2.08} pp improvement on average for classification tasks, with a substantial \textbf{7.50} pp gain on the challenging open-vocabulary Objaverse classification task and \textbf{4.88} pp improvement on 3D object captioning evaluated by Qwen2-72B-Instruct, validating the effectiveness of {\mname}. Code is publicly available at \href{https://github.com/yharoldsu0627/PointAlign}{https://github.com/yharoldsu0627/PointAlign}.
Related papers
- Unified Unsupervised and Sparsely-Supervised 3D Object Detection by Semantic Pseudo-Labeling and Prototype Learning [0.0]
3D object detection is essential for autonomous driving and robotic perception.<n>To reduce annotation dependency, unsupervised and sparsely-supervised paradigms have emerged.<n>This paper proposes SPL, a unified training framework for both Unsupervised and Sparsely-Supervised 3D Object Detection.
arXiv Detail & Related papers (2026-02-25T01:26:34Z) - Point Linguist Model: Segment Any Object via Bridged Large 3D-Language Model [51.02616473941499]
3D object segmentation with Large Language Models (LLMs) has become a prevailing paradigm due to its broad semantics, task flexibility, and strong generalization.<n>However, this paradigm is hindered by representation misalignment: LLMs process high-level semantic tokens, whereas 3D point clouds convey only dense geometric structures.<n>We present the Point Linguist Model (PLM), a general framework that bridges the representation gap between LLMs and dense 3D point clouds.
arXiv Detail & Related papers (2025-09-09T15:01:28Z) - Beyond BEV: Optimizing Point-Level Tokens for Collaborative Perception [17.654858416126093]
Collaborative perception allows agents to enhance their perceptual capabilities by exchanging intermediate features.<n>Existing methods typically organize these intermediate features as 2D bird's-eye-view (BEV) representations.<n>We present CoPLOT, a novel Collaborative perception framework that utilizes Point-Level Optimized Tokens.
arXiv Detail & Related papers (2025-08-27T07:27:42Z) - OccLE: Label-Efficient 3D Semantic Occupancy Prediction [68.60633561134571]
OccLE is a Label-Efficient 3D Semantic Occupancy Prediction.<n>It takes images and LiDAR as inputs and maintains high performance with limited voxel annotations.<n> Experiments show that OccLE achieves competitive performance with only 10% of voxel annotations.
arXiv Detail & Related papers (2025-05-27T01:41:28Z) - Self-Supervised and Generalizable Tokenization for CLIP-Based 3D Understanding [87.68271178167373]
We present a universal 3D tokenizer designed for scale-invariant representation learning with a frozen CLIP backbone.<n>S4Token is a tokenization pipeline that produces semantically-informed tokens regardless of scene scale.
arXiv Detail & Related papers (2025-05-24T18:26:30Z) - Rethinking End-to-End 2D to 3D Scene Segmentation in Gaussian Splatting [86.15347226865826]
We design a new end-to-end object-aware lifting approach, named Unified-Lift.<n>We augment each Gaussian point with an additional Gaussian-level feature learned using a contrastive loss to encode instance information.<n>We conduct experiments on three benchmarks: LERF-Masked, Replica, and Messy Rooms.
arXiv Detail & Related papers (2025-03-18T08:42:23Z) - NeuraLoc: Visual Localization in Neural Implicit Map with Dual Complementary Features [50.212836834889146]
We propose an efficient and novel visual localization approach based on the neural implicit map with complementary features.<n>Specifically, to enforce geometric constraints and reduce storage requirements, we implicitly learn a 3D keypoint descriptor field.<n>To further address the semantic ambiguity of descriptors, we introduce additional semantic contextual feature fields.
arXiv Detail & Related papers (2025-03-08T08:04:27Z) - Open-Vocabulary Octree-Graph for 3D Scene Understanding [54.11828083068082]
Octree-Graph is a novel scene representation for open-vocabulary 3D scene understanding.
An adaptive-octree structure is developed that stores semantics and depicts the occupancy of an object adjustably according to its shape.
arXiv Detail & Related papers (2024-11-25T10:14:10Z) - LESS: Label-Efficient and Single-Stage Referring 3D Segmentation [55.06002976797879]
Referring 3D is a visual-language task that segments all points of the specified object from a 3D point cloud described by a sentence of query.
We propose a novel Referring 3D pipeline, Label-Efficient and Single-Stage, dubbed LESS, which is only under the supervision of efficient binary mask.
We achieve state-of-the-art performance on ScanRefer dataset by surpassing the previous methods about 3.7% mIoU using only binary labels.
arXiv Detail & Related papers (2024-10-17T07:47:41Z) - Bayesian Self-Training for Semi-Supervised 3D Segmentation [59.544558398992386]
3D segmentation is a core problem in computer vision.
densely labeling 3D point clouds to employ fully-supervised training remains too labor intensive and expensive.
Semi-supervised training provides a more practical alternative, where only a small set of labeled data is given, accompanied by a larger unlabeled set.
arXiv Detail & Related papers (2024-09-12T14:54:31Z) - 3D-AVS: LiDAR-based 3D Auto-Vocabulary Segmentation [20.7179907935644]
3D-AVS is a method for Auto-Vocabulary of 3D point clouds for which the vocabulary is unknown and auto-generated for each input at runtime.<n>3D-AVS first recognizes semantic entities from image or point cloud data and then segments all points with the automatically generated vocabulary.<n>Our method incorporates both image-based and point-based recognition, enhancing robustness under challenging lighting conditions.
arXiv Detail & Related papers (2024-06-13T13:59:47Z) - EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual
Grounding [4.447173454116189]
3D visual grounding aims to find the object within point clouds mentioned by free-form natural language descriptions with rich semantic cues.
We present EDA that Explicitly Decouples the textual attributes in a sentence.
We further introduce a new visual grounding task, locating objects without object names, which can thoroughly evaluate the model's dense alignment capacity.
arXiv Detail & Related papers (2022-09-29T17:00:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.