Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds
- URL: http://arxiv.org/abs/2204.10688v1
- Date: Fri, 22 Apr 2022 13:07:37 GMT
- Title: Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds
- Authors: Heng Wang, Chaoyi Zhang, Jianhui Yu, Weidong Cai
- Abstract summary: Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding.
We propose a transformer-based encoder-decoder architecture, namely SpaCap3D, to transform objects into descriptions.
Our proposed SpaCap3D outperforms the baseline method Scan2Cap by 4.94% and 9.61% in CIDEr@0.5IoU, respectively.
- Score: 20.172702468478057
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dense captioning in 3D point clouds is an emerging vision-and-language task
involving object-level 3D scene understanding. Apart from coarse semantic class
prediction and bounding box regression as in traditional 3D object detection,
3D dense captioning aims at producing a further and finer instance-level label
of natural language description on visual appearance and spatial relations for
each scene object of interest. To detect and describe objects in a scene,
following the spirit of neural machine translation, we propose a
transformer-based encoder-decoder architecture, namely SpaCap3D, to transform
objects into descriptions, where we especially investigate the relative
spatiality of objects in 3D scenes and design a spatiality-guided encoder via a
token-to-token spatial relation learning objective and an object-centric
decoder for precise and spatiality-enhanced object caption generation.
Evaluated on two benchmark datasets, ScanRefer and ReferIt3D, our proposed
SpaCap3D outperforms the baseline method Scan2Cap by 4.94% and 9.61% in
CIDEr@0.5IoU, respectively. Our project page with source code and supplementary
files is available at https://SpaCap3D.github.io/ .
Related papers
- Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level.
Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z) - Object2Scene: Putting Objects in Context for Open-Vocabulary 3D
Detection [24.871590175483096]
Point cloud-based open-vocabulary 3D object detection aims to detect 3D categories that do not have ground-truth annotations in the training set.
Previous approaches leverage large-scale richly-annotated image datasets as a bridge between 3D and category semantics.
We propose Object2Scene, the first approach that leverages large-scale large-vocabulary 3D object datasets to augment existing 3D scene datasets for open-vocabulary 3D object detection.
arXiv Detail & Related papers (2023-09-18T03:31:53Z) - Generating Visual Spatial Description via Holistic 3D Scene
Understanding [88.99773815159345]
Visual spatial description (VSD) aims to generate texts that describe the spatial relations of the given objects within images.
With an external 3D scene extractor, we obtain the 3D objects and scene features for input images.
We construct a target object-centered 3D spatial scene graph (Go3D-S2G), such that we model the spatial semantics of target objects within the holistic 3D scenes.
arXiv Detail & Related papers (2023-05-19T15:53:56Z) - CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework.
Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene.
In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z) - HyperDet3D: Learning a Scene-conditioned 3D Object Detector [154.84798451437032]
We propose HyperDet3D to explore scene-conditioned prior knowledge for 3D object detection.
Our HyperDet3D achieves state-of-the-art results on the 3D object detection benchmark of the ScanNet and SUN RGB-D datasets.
arXiv Detail & Related papers (2022-04-12T07:57:58Z) - Point2Seq: Detecting 3D Objects as Sequences [58.63662049729309]
We present a simple and effective framework, named Point2Seq, for 3D object detection from point clouds.
We view each 3D object as a sequence of words and reformulate the 3D object detection task as decoding words from 3D scenes in an auto-regressive manner.
arXiv Detail & Related papers (2022-03-25T00:20:31Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z) - Scan2Cap: Context-aware Dense Captioning in RGB-D Scans [10.688467522949082]
We introduce the task of dense captioning in 3D scans from commodity RGB-D sensors.
We propose Scan2Cap, an end-to-end trained method, to detect objects in the input scene and describe them in natural language.
Our method can effectively localize and describe 3D objects in scenes from the ScanRefer dataset.
arXiv Detail & Related papers (2020-12-03T19:00:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.