3DAlign-DAER: Dynamic Attention Policy and Efficient Retrieval Strategy for Fine-grained 3D-Text Alignment at Scale
- URL: http://arxiv.org/abs/2511.13211v1
- Date: Mon, 17 Nov 2025 10:23:29 GMT
- Title: 3DAlign-DAER: Dynamic Attention Policy and Efficient Retrieval Strategy for Fine-grained 3D-Text Alignment at Scale
- Authors: Yijia Fan, Jusheng Zhang, Kaitong Cai, Jing Yang, Jian Wang, Keze Wang,
- Abstract summary: 3DAlign-DAER is a framework designed to align text and 3D geometry via the proposed dynamic attention policy and the efficient retrieval strategy.<n>To facilitate text-3D alignment research and train our 3DAlign-DAER, we construct Align3D-2M, a large-scale dataset featuring 2M text-3D pairs.
- Score: 13.561331612635044
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite recent advancements in 3D-text cross-modal alignment, existing state-of-the-art methods still struggle to align fine-grained textual semantics with detailed geometric structures, and their alignment performance degrades significantly when scaling to large-scale 3D databases. To overcome this limitation, we introduce 3DAlign-DAER, a unified framework designed to align text and 3D geometry via the proposed dynamic attention policy and the efficient retrieval strategy, capturing subtle correspondences for diverse cross-modal retrieval and classification tasks. Specifically, during the training, our proposed dynamic attention policy (DAP) employs the Hierarchical Attention Fusion (HAF) module to represent the alignment as learnable fine-grained token-to-point attentions. To optimize these attentions across different tasks and geometric hierarchies, our DAP further exploits the Monte Carlo tree search to dynamically calibrate HAF attention weights via a hybrid reward signal and further enhances the alignment between textual descriptions and local 3D geometry. During the inference, our 3DAlign-DAER introduces an Efficient Retrieval Strategy (ERS) to leverage efficient hierarchical searching in the large-scale embedding spaces, outperforming traditional methods (e.g., KNN) in accuracy and efficiency. Furthermore, to facilitate text-3D alignment research and train our 3DAlign-DAER, we construct Align3D-2M, a large-scale dataset featuring 2M text-3D pairs, to provide sufficient fine-grained cross-modal annotations. Extensive and comprehensive experiments demonstrate the superior performance of our 3DAlign-DAER on diverse benchmarks. We will release our codes, models, and datasets.
Related papers
- SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs [21.891285551179365]
We introduce Spherical Coordinate-based Positional Embedding (SoPE)<n>Our method maps point-cloud token indices into a 3D spherical coordinate space, enabling unified modeling of spatial locations and directional angles.<n>This formulation preserves the inherent geometric structure of point-cloud data, enhances spatial awareness, and yields more consistent and expressive geometric representations for multimodal learning.
arXiv Detail & Related papers (2026-02-26T07:42:15Z) - TIGaussian: Disentangle Gaussians for Spatial-Awared Text-Image-3D Alignment [58.46706158310462]
TIGaussian harnesses 3D Gaussian Splatting (3DGS) characteristics to strengthen cross-modality alignment.<n>Our multi-branch 3DGS tokenizer decouples the intrinsic properties of 3DGS structures into compact latent representations.<n>A text-3D projection module adaptively maps 3D features to text embedding space for better text-3D alignment.
arXiv Detail & Related papers (2026-01-27T06:30:32Z) - Boosting Multi-View Indoor 3D Object Detection via Adaptive 3D Volume Construction [10.569056109735735]
This work presents SGCDet, a novel multi-view indoor 3D object detection framework based on adaptive 3D volume construction.<n>We introduce a geometry and context aware aggregation module to integrate geometric and contextual information within adaptive regions in each image.<n>We show that SGCDet achieves state-of-the-art performance on the ScanNet, ScanNet200 and ARKitScenes datasets.
arXiv Detail & Related papers (2025-07-24T11:58:01Z) - econSG: Efficient and Multi-view Consistent Open-Vocabulary 3D Semantic Gaussians [56.85804719947]
We propose econSG for open-vocabulary semantic segmentation with 3DGS.<n>Our econSG shows state-of-the-art performance on four benchmark datasets compared to the existing methods.
arXiv Detail & Related papers (2025-04-08T13:12:31Z) - Enhanced Cross-modal 3D Retrieval via Tri-modal Reconstruction [4.820576346277399]
Cross-modal 3D retrieval is a critical yet challenging task, aiming to achieve bi-directional retrieval between 3D and text modalities.<n>We propose to adopt multi-view images and point clouds to jointly represent 3D shapes, facilitating tri-modal alignment.<n>Our method significantly outperforms previous state-of-the-art methods in both shape-to-text and text-to-shape retrieval tasks.
arXiv Detail & Related papers (2025-04-02T08:29:42Z) - Hierarchical Cross-Modal Alignment for Open-Vocabulary 3D Object Detection [45.68105299990119]
Open-vocabulary 3D object detection (OV-3DOD) aims at localizing and classifying novel objects beyond closed sets.<n>We propose a hierarchical framework, named HCMA, to simultaneously learn local object and global scene information for OV-3DOD.
arXiv Detail & Related papers (2025-03-10T17:55:22Z) - Escaping Plato's Cave: Towards the Alignment of 3D and Text Latent Spaces [52.237827968294766]
We show that naive post-training feature alignment of uni-modal text and 3D encoders results in limited performance.<n>We then focus on extracting subspaces of the corresponding feature spaces and discover that by projecting learned representations onto well-chosen lower-dimensional subspaces the quality of alignment becomes significantly higher.<n>Ours is the first work that helps to establish a baseline for post-training alignment of 3D uni-modal and text feature spaces.
arXiv Detail & Related papers (2025-03-07T09:51:56Z) - GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency [50.11520458252128]
Existing 3D affordance learning methods struggle with generalization and robustness due to limited annotated data.<n>We propose GEAL, a novel framework designed to enhance the generalization and robustness of 3D affordance learning by leveraging large-scale pre-trained 2D models.<n>GEAL consistently outperforms existing methods across seen and novel object categories, as well as corrupted data.
arXiv Detail & Related papers (2024-12-12T17:59:03Z) - SeMv-3D: Towards Concurrency of Semantic and Multi-view Consistency in General Text-to-3D Generation [122.47961178994456]
SeMv-3D is a novel framework that jointly enhances semantic alignment and multi-view consistency in GT23D generation.<n>At its core, we introduce Triplane Prior Learning (TPL), which effectively learns triplane priors.<n>We also present Prior-based Semantic Aligning in Triplanes (SAT), which enables consistent any-view synthesis.
arXiv Detail & Related papers (2024-10-10T07:02:06Z) - MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.<n>The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z) - ALSTER: A Local Spatio-Temporal Expert for Online 3D Semantic
Reconstruction [62.599588577671796]
We propose an online 3D semantic segmentation method that incrementally reconstructs a 3D semantic map from a stream of RGB-D frames.
Unlike offline methods, ours is directly applicable to scenarios with real-time constraints, such as robotics or mixed reality.
arXiv Detail & Related papers (2023-11-29T20:30:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.