Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End
3D Dense Captioning
- URL: http://arxiv.org/abs/2309.02999v1
- Date: Wed, 6 Sep 2023 13:43:27 GMT
- Title: Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End
3D Dense Captioning
- Authors: Sijin Chen, Hongyuan Zhu, Mingsheng Li, Xin Chen, Peng Guo, Yinjie
Lei, Gang Yu, Taihao Li, and Tao Chen
- Abstract summary: 3D dense captioning requires a model to translate its understanding of an input 3D scene into several captions associated with different object regions.
Existing methods adopt a sophisticated "detect-then-describe" pipeline, which builds explicit relation modules upon a 3D detector with numerous hand-crafted components.
We first propose Vote2Cap-DETR, a simple-yet-effective transformer framework that decouples the decoding process of caption generation and object localization through parallel decoding.
- Score: 37.44886367452029
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 3D dense captioning requires a model to translate its understanding of an
input 3D scene into several captions associated with different object regions.
Existing methods adopt a sophisticated "detect-then-describe" pipeline, which
builds explicit relation modules upon a 3D detector with numerous hand-crafted
components. While these methods have achieved initial success, the cascade
pipeline tends to accumulate errors because of duplicated and inaccurate box
estimations and messy 3D scenes. In this paper, we first propose Vote2Cap-DETR,
a simple-yet-effective transformer framework that decouples the decoding
process of caption generation and object localization through parallel
decoding. Moreover, we argue that object localization and description
generation require different levels of scene understanding, which could be
challenging for a shared set of queries to capture. To this end, we propose an
advanced version, Vote2Cap-DETR++, which decouples the queries into
localization and caption queries to capture task-specific features.
Additionally, we introduce the iterative spatial refinement strategy to vote
queries for faster convergence and better localization performance. We also
insert additional spatial information to the caption head for more accurate
descriptions. Without bells and whistles, extensive experiments on two commonly
used datasets, ScanRefer and Nr3D, demonstrate Vote2Cap-DETR and
Vote2Cap-DETR++ surpass conventional "detect-then-describe" methods by a large
margin. Codes will be made available at
https://github.com/ch3cook-fdu/Vote2Cap-DETR.
Related papers
- See It All: Contextualized Late Aggregation for 3D Dense Captioning [38.14179122810755]
3D dense captioning is a task to localize objects in a 3D scene and generate descriptive sentences for each object.
Recent approaches in 3D dense captioning have adopted transformer encoder-decoder frameworks from object detection to build an end-to-end pipeline without hand-crafted components.
We introduce SIA (See-It-All), a transformer pipeline that engages in 3D dense captioning with a novel paradigm called late aggregation.
arXiv Detail & Related papers (2024-08-14T16:19:18Z) - View Selection for 3D Captioning via Diffusion Ranking [54.78058803763221]
Cap3D method renders 3D objects into 2D views for captioning using pre-trained models.
Some rendered views of 3D objects are atypical, deviating from the training data of standard image captioning models and causing hallucinations.
We present DiffuRank, a method that leverages a pre-trained text-to-3D model to assess the alignment between 3D objects and their 2D rendered views.
arXiv Detail & Related papers (2024-04-11T17:58:11Z) - Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level.
Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z) - End-to-End 3D Dense Captioning with Vote2Cap-DETR [45.18715911775949]
3D dense captioning aims to generate multiple captions localized with their associated object regions.
We propose a simple-yet-effective transformer framework Vote2Cap-DETR based on recent popular textbfDEtection textbfTRansformer (DETR)
Our framework is based on a full transformer encoder-decoder architecture with a learnable vote query driven object decoder, and a caption decoder that produces the dense captions in a set-prediction manner.
arXiv Detail & Related papers (2023-01-06T13:46:45Z) - Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds [20.172702468478057]
Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding.
We propose a transformer-based encoder-decoder architecture, namely SpaCap3D, to transform objects into descriptions.
Our proposed SpaCap3D outperforms the baseline method Scan2Cap by 4.94% and 9.61% in CIDEr@0.5IoU, respectively.
arXiv Detail & Related papers (2022-04-22T13:07:37Z) - MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection [61.89277940084792]
We introduce the first DETR framework for Monocular DEtection with a depth-guided TRansformer, named MonoDETR.
We formulate 3D object candidates as learnable queries and propose a depth-guided decoder to conduct object-scene depth interactions.
On KITTI benchmark with monocular images as input, MonoDETR achieves state-of-the-art performance and requires no extra dense depth annotations.
arXiv Detail & Related papers (2022-03-24T19:28:54Z) - End-to-End Dense Video Captioning with Parallel Decoding [53.34238344647624]
We propose a simple yet effective framework for end-to-end dense video captioning with parallel decoding (PDVC)
PDVC precisely segments the video into a number of event pieces under the holistic understanding of the video content.
experiments on ActivityNet Captions and YouCook2 show that PDVC is capable of producing high-quality captioning results.
arXiv Detail & Related papers (2021-08-17T17:39:15Z) - Scan2Cap: Context-aware Dense Captioning in RGB-D Scans [10.688467522949082]
We introduce the task of dense captioning in 3D scans from commodity RGB-D sensors.
We propose Scan2Cap, an end-to-end trained method, to detect objects in the input scene and describe them in natural language.
Our method can effectively localize and describe 3D objects in scenes from the ScanRefer dataset.
arXiv Detail & Related papers (2020-12-03T19:00:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.