Related papers: PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models

PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models

URL: http://arxiv.org/abs/2603.00412v1
Date: Sat, 28 Feb 2026 02:17:46 GMT
Title: PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models
Authors: Yuanhao Su, Shaofeng Zhang, Xiaosong Jia, Qi Fan,
Abstract summary: Existing methods rely solely on next-token prediction loss, using only language tokens for supervision.<n>mname explicitly supervises intermediate point cloud tokens to preserve fine-grained 3D geometric-semantic information.<n>Experiments on ModelNet40 and averse datasets demonstrate that our method achieves textbf2.08 pp improvement on average for classification tasks.
Score: 23.263895549689863
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The development of 3D Vision-Language Models (VLMs), crucial for applications in robotics, autonomous driving, and augmented reality, is severely constrained by the scarcity of paired 3D-text data. Existing methods rely solely on next-token prediction loss, using only language tokens for supervision. This results in inefficient utilization of limited 3D data and leads to a significant degradation and loss of valuable geometric information in intermediate representations. To address these limitations, we propose {\mname}, a novel feature-level alignment regularization method. {\mname} explicitly supervises intermediate point cloud tokens to preserve fine-grained 3D geometric-semantic information throughout the language modeling process. Specifically, we constrain the intermediate point cloud tokens within the LLM to align with visual input tokens via a consistency loss. By training only a lightweight alignment projector and LoRA adapters, {\mname} achieves explicit feature-level supervision with minimal computational overhead, effectively preventing geometric degradation. Extensive experiments on ModelNet40 and Objaverse datasets demonstrate that our method achieves \textbf{2.08} pp improvement on average for classification tasks, with a substantial \textbf{7.50} pp gain on the challenging open-vocabulary Objaverse classification task and \textbf{4.88} pp improvement on 3D object captioning evaluated by Qwen2-72B-Instruct, validating the effectiveness of {\mname}. Code is publicly available at \href{https://github.com/yharoldsu0627/PointAlign}{https://github.com/yharoldsu0627/PointAlign}.

Related papers

Unified Unsupervised and Sparsely-Supervised 3D Object Detection by Semantic Pseudo-Labeling and Prototype Learning [0.0]
3D object detection is essential for autonomous driving and robotic perception.<n>To reduce annotation dependency, unsupervised and sparsely-supervised paradigms have emerged.<n>This paper proposes SPL, a unified training framework for both Unsupervised and Sparsely-Supervised 3D Object Detection.
arXiv Detail & Related papers (2026-02-25T01:26:34Z)
Point Linguist Model: Segment Any Object via Bridged Large 3D-Language Model [51.02616473941499]
3D object segmentation with Large Language Models (LLMs) has become a prevailing paradigm due to its broad semantics, task flexibility, and strong generalization.<n>However, this paradigm is hindered by representation misalignment: LLMs process high-level semantic tokens, whereas 3D point clouds convey only dense geometric structures.<n>We present the Point Linguist Model (PLM), a general framework that bridges the representation gap between LLMs and dense 3D point clouds.
arXiv Detail & Related papers (2025-09-09T15:01:28Z)
Beyond BEV: Optimizing Point-Level Tokens for Collaborative Perception [17.654858416126093]
Collaborative perception allows agents to enhance their perceptual capabilities by exchanging intermediate features.<n>Existing methods typically organize these intermediate features as 2D bird's-eye-view (BEV) representations.<n>We present CoPLOT, a novel Collaborative perception framework that utilizes Point-Level Optimized Tokens.
arXiv Detail & Related papers (2025-08-27T07:27:42Z)
OccLE: Label-Efficient 3D Semantic Occupancy Prediction [68.60633561134571]
OccLE is a Label-Efficient 3D Semantic Occupancy Prediction.<n>It takes images and LiDAR as inputs and maintains high performance with limited voxel annotations.<n> Experiments show that OccLE achieves competitive performance with only 10% of voxel annotations.
arXiv Detail & Related papers (2025-05-27T01:41:28Z)
Self-Supervised and Generalizable Tokenization for CLIP-Based 3D Understanding [87.68271178167373]
We present a universal 3D tokenizer designed for scale-invariant representation learning with a frozen CLIP backbone.<n>S4Token is a tokenization pipeline that produces semantically-informed tokens regardless of scene scale.
arXiv Detail & Related papers (2025-05-24T18:26:30Z)
Rethinking End-to-End 2D to 3D Scene Segmentation in Gaussian Splatting [86.15347226865826]
We design a new end-to-end object-aware lifting approach, named Unified-Lift.<n>We augment each Gaussian point with an additional Gaussian-level feature learned using a contrastive loss to encode instance information.<n>We conduct experiments on three benchmarks: LERF-Masked, Replica, and Messy Rooms.
arXiv Detail & Related papers (2025-03-18T08:42:23Z)
NeuraLoc: Visual Localization in Neural Implicit Map with Dual Complementary Features [50.212836834889146]
We propose an efficient and novel visual localization approach based on the neural implicit map with complementary features.<n>Specifically, to enforce geometric constraints and reduce storage requirements, we implicitly learn a 3D keypoint descriptor field.<n>To further address the semantic ambiguity of descriptors, we introduce additional semantic contextual feature fields.
arXiv Detail & Related papers (2025-03-08T08:04:27Z)
Open-Vocabulary Octree-Graph for 3D Scene Understanding [54.11828083068082]
Octree-Graph is a novel scene representation for open-vocabulary 3D scene understanding. An adaptive-octree structure is developed that stores semantics and depicts the occupancy of an object adjustably according to its shape.
arXiv Detail & Related papers (2024-11-25T10:14:10Z)
LESS: Label-Efficient and Single-Stage Referring 3D Segmentation [55.06002976797879]
Referring 3D is a visual-language task that segments all points of the specified object from a 3D point cloud described by a sentence of query. We propose a novel Referring 3D pipeline, Label-Efficient and Single-Stage, dubbed LESS, which is only under the supervision of efficient binary mask. We achieve state-of-the-art performance on ScanRefer dataset by surpassing the previous methods about 3.7% mIoU using only binary labels.
arXiv Detail & Related papers (2024-10-17T07:47:41Z)
Bayesian Self-Training for Semi-Supervised 3D Segmentation [59.544558398992386]
3D segmentation is a core problem in computer vision. densely labeling 3D point clouds to employ fully-supervised training remains too labor intensive and expensive. Semi-supervised training provides a more practical alternative, where only a small set of labeled data is given, accompanied by a larger unlabeled set.
arXiv Detail & Related papers (2024-09-12T14:54:31Z)
3D-AVS: LiDAR-based 3D Auto-Vocabulary Segmentation [20.7179907935644]
3D-AVS is a method for Auto-Vocabulary of 3D point clouds for which the vocabulary is unknown and auto-generated for each input at runtime.<n>3D-AVS first recognizes semantic entities from image or point cloud data and then segments all points with the automatically generated vocabulary.<n>Our method incorporates both image-based and point-based recognition, enhancing robustness under challenging lighting conditions.
arXiv Detail & Related papers (2024-06-13T13:59:47Z)
EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding [4.447173454116189]
3D visual grounding aims to find the object within point clouds mentioned by free-form natural language descriptions with rich semantic cues. We present EDA that Explicitly Decouples the textual attributes in a sentence. We further introduce a new visual grounding task, locating objects without object names, which can thoroughly evaluate the model's dense alignment capacity.
arXiv Detail & Related papers (2022-09-29T17:00:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.