Related papers: 3D CoCa v2: Contrastive Learners with Test-Time Search for Generalizable Spatial Intelligence

3D CoCa v2: Contrastive Learners with Test-Time Search for Generalizable Spatial Intelligence

URL: http://arxiv.org/abs/2601.06496v1
Date: Sat, 10 Jan 2026 09:13:10 GMT
Title: 3D CoCa v2: Contrastive Learners with Test-Time Search for Generalizable Spatial Intelligence
Authors: Hao Tang, Ting Huang, Zeyu Zhang,
Abstract summary: 3D captioning aims to describe 3D scenes in natural language.<n>We propose 3D CoCa v2, a generalizable 3D captioning framework.<n>We show improvements over 3D CoCa of +1.50 CIDEr@0.5IoU on ScanRefer and +1.61 CIDEr@0.5IoU on Nr3D.
Score: 15.064925965953122
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Spatial intelligence refers to the ability to perceive, reason about, and describe objects and their relationships within three-dimensional environments, forming a foundation for embodied perception and scene understanding. 3D captioning aims to describe 3D scenes in natural language; however, it remains challenging due to the sparsity and irregularity of point clouds and, more critically, the weak grounding and limited out-of-distribution (OOD) generalization of existing captioners across drastically different environments, including indoor and outdoor 3D scenes. To address this challenge, we propose 3D CoCa v2, a generalizable 3D captioning framework that unifies contrastive vision-language learning with 3D caption generation and further improves robustness via test-time search (TTS) without updating the captioner parameters. 3D CoCa v2 builds on a frozen CLIP-based semantic prior, a spatially-aware 3D scene encoder for geometry, and a multimodal decoder jointly optimized with contrastive and captioning objectives, avoiding external detectors or handcrafted proposals. At inference, TTS produces diverse caption candidates and performs reward-guided selection using a compact scene summary. Experiments show improvements over 3D CoCa of +1.50 CIDEr@0.5IoU on ScanRefer and +1.61 CIDEr@0.5IoU on Nr3D, and +3.8 CIDEr@0.25 in zero-shot OOD evaluation on TOD3Cap. Code will be released at https://github.com/AIGeeksGroup/3DCoCav2.

Related papers

LabelAny3D: Label Any Object 3D in the Wild [18.044792932630752]
COCO3D is a new benchmark for open-vocabulary monocular 3D detection, derived from the MS-COCO dataset.<n>We introduce LabelAny3D, an emphanalysis-by-synthesis framework that reconstructs holistic 3D scenes from 2D images to efficiently produce high-quality 3D bounding box annotations.
arXiv Detail & Related papers (2026-01-04T22:03:45Z)
LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight [105.9472902251177]
We present a VLM-native recipe that casts 3D detection as a next-token prediction problem.<n>Our model achieves state-of-the-art results, with 49.89 AP_3D, surpassing the previous best by +15.51 absolute improvement.
arXiv Detail & Related papers (2025-11-25T18:59:45Z)
Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs [72.11701578308804]
This paper categorizes recent 3D Vision-Language Models into 3D object-centric, 2D image-based, and 3D scene-centric approaches.<n>Despite the architectural similarity of 3D scene-centric VLMs to their 2D counterparts, they have exhibited comparatively lower performance compared with the latest 3D object-centric and 2D image-based approaches.<n>Our investigation suggests that while these models possess cross-modal alignment capabilities, they tend to over-rely on linguistic cues and overfit to frequent answer distributions.
arXiv Detail & Related papers (2025-06-05T17:56:12Z)
3D CoCa: Contrastive Learners are 3D Captioners [10.132943642539828]
3D captioning, which aims to describe the content of 3D scenes in natural language, remains highly challenging.<n>We propose 3D CoCa, a novel unified framework that seamlessly combines contrastive vision-language learning with 3D caption generation.
arXiv Detail & Related papers (2025-04-13T11:10:47Z)
View Selection for 3D Captioning via Diffusion Ranking [54.78058803763221]
Cap3D method renders 3D objects into 2D views for captioning using pre-trained models.<n>Some rendered views of 3D objects are atypical, deviating from the training data of standard image captioning models and causing hallucinations.<n>We present DiffuRank, a method that leverages a pre-trained text-to-3D model to assess the alignment between 3D objects and their 2D rendered views.
arXiv Detail & Related papers (2024-04-11T17:58:11Z)
Model2Scene: Learning 3D Scene Representation via Contrastive Language-CAD Models Pre-training [105.3421541518582]
Current successful methods of 3D scene perception rely on the large-scale annotated point cloud. We propose Model2Scene, a novel paradigm that learns free 3D scene representation from Computer-Aided Design (CAD) models and languages. Model2Scene yields impressive label-free 3D object salient detection with an average mAP of 46.08% and 55.49% on the ScanNet and S3DIS datasets, respectively.
arXiv Detail & Related papers (2023-09-29T03:51:26Z)
Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language. We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z)
Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore. We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z)
Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds [20.172702468478057]
Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding. We propose a transformer-based encoder-decoder architecture, namely SpaCap3D, to transform objects into descriptions. Our proposed SpaCap3D outperforms the baseline method Scan2Cap by 4.94% and 9.61% in CIDEr@0.5IoU, respectively.
arXiv Detail & Related papers (2022-04-22T13:07:37Z)
D3Net: A Speaker-Listener Architecture for Semi-supervised Dense Captioning and Visual Grounding in RGB-D Scans [12.217810313293883]
We present D3Net, an end-to-end neural speaker-listener architecture that can detect, describe and discriminate. Our method outperforms SOTA methods in both tasks on the Scan dataset.
arXiv Detail & Related papers (2021-12-02T19:00:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.