3D CoCa v2: Contrastive Learners with Test-Time Search for Generalizable Spatial Intelligence
- URL: http://arxiv.org/abs/2601.06496v1
- Date: Sat, 10 Jan 2026 09:13:10 GMT
- Title: 3D CoCa v2: Contrastive Learners with Test-Time Search for Generalizable Spatial Intelligence
- Authors: Hao Tang, Ting Huang, Zeyu Zhang,
- Abstract summary: 3D captioning aims to describe 3D scenes in natural language.<n>We propose 3D CoCa v2, a generalizable 3D captioning framework.<n>We show improvements over 3D CoCa of +1.50 CIDEr@0.5IoU on ScanRefer and +1.61 CIDEr@0.5IoU on Nr3D.
- Score: 15.064925965953122
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Spatial intelligence refers to the ability to perceive, reason about, and describe objects and their relationships within three-dimensional environments, forming a foundation for embodied perception and scene understanding. 3D captioning aims to describe 3D scenes in natural language; however, it remains challenging due to the sparsity and irregularity of point clouds and, more critically, the weak grounding and limited out-of-distribution (OOD) generalization of existing captioners across drastically different environments, including indoor and outdoor 3D scenes. To address this challenge, we propose 3D CoCa v2, a generalizable 3D captioning framework that unifies contrastive vision-language learning with 3D caption generation and further improves robustness via test-time search (TTS) without updating the captioner parameters. 3D CoCa v2 builds on a frozen CLIP-based semantic prior, a spatially-aware 3D scene encoder for geometry, and a multimodal decoder jointly optimized with contrastive and captioning objectives, avoiding external detectors or handcrafted proposals. At inference, TTS produces diverse caption candidates and performs reward-guided selection using a compact scene summary. Experiments show improvements over 3D CoCa of +1.50 CIDEr@0.5IoU on ScanRefer and +1.61 CIDEr@0.5IoU on Nr3D, and +3.8 CIDEr@0.25 in zero-shot OOD evaluation on TOD3Cap. Code will be released at https://github.com/AIGeeksGroup/3DCoCav2.
Related papers
- LabelAny3D: Label Any Object 3D in the Wild [18.044792932630752]
COCO3D is a new benchmark for open-vocabulary monocular 3D detection, derived from the MS-COCO dataset.<n>We introduce LabelAny3D, an emphanalysis-by-synthesis framework that reconstructs holistic 3D scenes from 2D images to efficiently produce high-quality 3D bounding box annotations.
arXiv Detail & Related papers (2026-01-04T22:03:45Z) - LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight [105.9472902251177]
We present a VLM-native recipe that casts 3D detection as a next-token prediction problem.<n>Our model achieves state-of-the-art results, with 49.89 AP_3D, surpassing the previous best by +15.51 absolute improvement.
arXiv Detail & Related papers (2025-11-25T18:59:45Z) - Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs [72.11701578308804]
This paper categorizes recent 3D Vision-Language Models into 3D object-centric, 2D image-based, and 3D scene-centric approaches.<n>Despite the architectural similarity of 3D scene-centric VLMs to their 2D counterparts, they have exhibited comparatively lower performance compared with the latest 3D object-centric and 2D image-based approaches.<n>Our investigation suggests that while these models possess cross-modal alignment capabilities, they tend to over-rely on linguistic cues and overfit to frequent answer distributions.
arXiv Detail & Related papers (2025-06-05T17:56:12Z) - 3D CoCa: Contrastive Learners are 3D Captioners [10.132943642539828]
3D captioning, which aims to describe the content of 3D scenes in natural language, remains highly challenging.<n>We propose 3D CoCa, a novel unified framework that seamlessly combines contrastive vision-language learning with 3D caption generation.
arXiv Detail & Related papers (2025-04-13T11:10:47Z) - View Selection for 3D Captioning via Diffusion Ranking [54.78058803763221]
Cap3D method renders 3D objects into 2D views for captioning using pre-trained models.<n>Some rendered views of 3D objects are atypical, deviating from the training data of standard image captioning models and causing hallucinations.<n>We present DiffuRank, a method that leverages a pre-trained text-to-3D model to assess the alignment between 3D objects and their 2D rendered views.
arXiv Detail & Related papers (2024-04-11T17:58:11Z) - Model2Scene: Learning 3D Scene Representation via Contrastive
Language-CAD Models Pre-training [105.3421541518582]
Current successful methods of 3D scene perception rely on the large-scale annotated point cloud.
We propose Model2Scene, a novel paradigm that learns free 3D scene representation from Computer-Aided Design (CAD) models and languages.
Model2Scene yields impressive label-free 3D object salient detection with an average mAP of 46.08% and 55.49% on the ScanNet and S3DIS datasets, respectively.
arXiv Detail & Related papers (2023-09-29T03:51:26Z) - Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language.
We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z) - Multi-CLIP: Contrastive Vision-Language Pre-training for Question
Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore.
We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z) - Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds [20.172702468478057]
Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding.
We propose a transformer-based encoder-decoder architecture, namely SpaCap3D, to transform objects into descriptions.
Our proposed SpaCap3D outperforms the baseline method Scan2Cap by 4.94% and 9.61% in CIDEr@0.5IoU, respectively.
arXiv Detail & Related papers (2022-04-22T13:07:37Z) - D3Net: A Speaker-Listener Architecture for Semi-supervised Dense
Captioning and Visual Grounding in RGB-D Scans [12.217810313293883]
We present D3Net, an end-to-end neural speaker-listener architecture that can detect, describe and discriminate.
Our method outperforms SOTA methods in both tasks on the Scan dataset.
arXiv Detail & Related papers (2021-12-02T19:00:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.