ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail
- URL: http://arxiv.org/abs/2503.17044v1
- Date: Fri, 21 Mar 2025 11:00:12 GMT
- Title: ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail
- Authors: Chandan Yeshwanth, David Rozenberszki, Angela Dai,
- Abstract summary: We present ExCap3D, an expressive 3D captioning model which takes as input a 3D scan.<n>For each detected object in the scan, ExCap3D generates a fine-grained collective description of the parts of the object.<n>Our experiments show that the object- and part-level of detail captions generated by ExCap3D are of higher quality than those produced by state-of-the-art methods.
- Score: 22.874510058207633
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating text descriptions of objects in 3D indoor scenes is an important building block of embodied understanding. Existing methods do this by describing objects at a single level of detail, which often does not capture fine-grained details such as varying textures, materials, and shapes of the parts of objects. We propose the task of expressive 3D captioning: given an input 3D scene, describe objects at multiple levels of detail: a high-level object description, and a low-level description of the properties of its parts. To produce such captions, we present ExCap3D, an expressive 3D captioning model which takes as input a 3D scan, and for each detected object in the scan, generates a fine-grained collective description of the parts of the object, along with an object-level description conditioned on the part-level description. We design ExCap3D to encourage semantic consistency between the generated text descriptions, as well as textual similarity in the latent space, to further increase the quality of the generated captions. To enable this task, we generated the ExCap3D Dataset by leveraging a visual-language model (VLM) for multi-view captioning. The ExCap3D Dataset contains captions on the ScanNet++ dataset with varying levels of detail, comprising 190k text descriptions of 34k 3D objects in 947 indoor scenes. Our experiments show that the object- and part-level of detail captions generated by ExCap3D are of higher quality than those produced by state-of-the-art methods, with a Cider score improvement of 17% and 124% for object- and part-level details respectively. Our code, dataset and models will be made publicly available.
Related papers
- View Selection for 3D Captioning via Diffusion Ranking [54.78058803763221]
Cap3D method renders 3D objects into 2D views for captioning using pre-trained models.
Some rendered views of 3D objects are atypical, deviating from the training data of standard image captioning models and causing hallucinations.
We present DiffuRank, a method that leverages a pre-trained text-to-3D model to assess the alignment between 3D objects and their 2D rendered views.
arXiv Detail & Related papers (2024-04-11T17:58:11Z) - Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level.
Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z) - TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes [67.5351491691866]
We present a novel framework, dubbed TeMO, to parse multi-object 3D scenes and edit their styles.
Our method can synthesize high-quality stylized content and outperform the existing methods over a wide range of multi-object 3D meshes.
arXiv Detail & Related papers (2023-12-07T12:10:05Z) - Explore and Tell: Embodied Visual Captioning in 3D Environments [83.00553567094998]
In real-world scenarios, a single image may not offer a good viewpoint, hindering fine-grained scene understanding.
We propose a novel task called Embodied Captioning, which equips visual captioning models with navigation capabilities.
We propose a Cascade Embodied Captioning model (CaBOT), which comprises of a navigator and a captioner, to tackle this task.
arXiv Detail & Related papers (2023-08-21T03:46:04Z) - Scalable 3D Captioning with Pretrained Models [63.16604472745202]
Cap3D is an automatic approach for generating descriptive text for 3D objects.
We apply Cap3D to the recently introduced large-scale 3D dataset.
Our evaluation, conducted using 41k human annotations from the same dataset, demonstrates that Cap3D surpasses human descriptions in terms of quality, cost, and speed.
arXiv Detail & Related papers (2023-06-12T17:59:03Z) - 3D-TOGO: Towards Text-Guided Cross-Category 3D Object Generation [107.46972849241168]
3D-TOGO model generates 3D objects in the form of the neural radiance field with good texture.
Experiments on the largest 3D object dataset (i.e., ABO) are conducted to verify that 3D-TOGO can better generate high-quality 3D objects.
arXiv Detail & Related papers (2022-12-02T11:31:49Z) - Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds [20.172702468478057]
Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding.
We propose a transformer-based encoder-decoder architecture, namely SpaCap3D, to transform objects into descriptions.
Our proposed SpaCap3D outperforms the baseline method Scan2Cap by 4.94% and 9.61% in CIDEr@0.5IoU, respectively.
arXiv Detail & Related papers (2022-04-22T13:07:37Z) - Scan2Cap: Context-aware Dense Captioning in RGB-D Scans [10.688467522949082]
We introduce the task of dense captioning in 3D scans from commodity RGB-D sensors.
We propose Scan2Cap, an end-to-end trained method, to detect objects in the input scene and describe them in natural language.
Our method can effectively localize and describe 3D objects in scenes from the ScanRefer dataset.
arXiv Detail & Related papers (2020-12-03T19:00:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.