End-to-End 3D Dense Captioning with Vote2Cap-DETR
- URL: http://arxiv.org/abs/2301.02508v1
- Date: Fri, 6 Jan 2023 13:46:45 GMT
- Title: End-to-End 3D Dense Captioning with Vote2Cap-DETR
- Authors: Sijin Chen, Hongyuan Zhu, Xin Chen, Yinjie Lei, Tao Chen, Gang YU
- Abstract summary: 3D dense captioning aims to generate multiple captions localized with their associated object regions.
We propose a simple-yet-effective transformer framework Vote2Cap-DETR based on recent popular textbfDEtection textbfTRansformer (DETR)
Our framework is based on a full transformer encoder-decoder architecture with a learnable vote query driven object decoder, and a caption decoder that produces the dense captions in a set-prediction manner.
- Score: 45.18715911775949
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 3D dense captioning aims to generate multiple captions localized with their
associated object regions. Existing methods follow a sophisticated
``detect-then-describe'' pipeline equipped with numerous hand-crafted
components. However, these hand-crafted components would yield suboptimal
performance given cluttered object spatial and class distributions among
different scenes. In this paper, we propose a simple-yet-effective transformer
framework Vote2Cap-DETR based on recent popular \textbf{DE}tection
\textbf{TR}ansformer (DETR). Compared with prior arts, our framework has
several appealing advantages: 1) Without resorting to numerous hand-crafted
components, our method is based on a full transformer encoder-decoder
architecture with a learnable vote query driven object decoder, and a caption
decoder that produces the dense captions in a set-prediction manner. 2) In
contrast to the two-stage scheme, our method can perform detection and
captioning in one-stage. 3) Without bells and whistles, extensive experiments
on two commonly used datasets, ScanRefer and Nr3D, demonstrate that our
Vote2Cap-DETR surpasses current state-of-the-arts by 11.13\% and 7.11\% in
CIDEr@0.5IoU, respectively. Codes will be released soon.
Related papers
- See It All: Contextualized Late Aggregation for 3D Dense Captioning [38.14179122810755]
3D dense captioning is a task to localize objects in a 3D scene and generate descriptive sentences for each object.
Recent approaches in 3D dense captioning have adopted transformer encoder-decoder frameworks from object detection to build an end-to-end pipeline without hand-crafted components.
We introduce SIA (See-It-All), a transformer pipeline that engages in 3D dense captioning with a novel paradigm called late aggregation.
arXiv Detail & Related papers (2024-08-14T16:19:18Z) - Retrieval Enhanced Zero-Shot Video Captioning [69.96136689829778]
We bridge video and text using three key models: a general video understanding model XCLIP, a general image understanding model CLIP, and a text generation model GPT-2.
To address this problem, we propose using learnable tokens as a communication medium between frozen GPT-2 and frozen XCLIP.
Experiments show 4% to 20% improvements in terms of the main metric CIDEr compared to the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-05-11T16:22:00Z) - Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation [34.45033554641476]
Existing automatic captioning methods for visual content face challenges such as lack of detail, hallucination content, and poor instruction following.
We propose VisualFactChecker (VFC), a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects.
VFC consists of three steps: 1) proposal, where image-to-text captioning models propose multiple initial captions; 2) verification, where a large language model (LLM) utilizes tools such as object detection and VQA models to fact-check proposed captions.
arXiv Detail & Related papers (2024-04-30T17:55:27Z) - View Selection for 3D Captioning via Diffusion Ranking [54.78058803763221]
Cap3D method renders 3D objects into 2D views for captioning using pre-trained models.
Some rendered views of 3D objects are atypical, deviating from the training data of standard image captioning models and causing hallucinations.
We present DiffuRank, a method that leverages a pre-trained text-to-3D model to assess the alignment between 3D objects and their 2D rendered views.
arXiv Detail & Related papers (2024-04-11T17:58:11Z) - Accurate and Fast Compressed Video Captioning [28.19362369787383]
Existing video captioning approaches typically require to first sample video frames from a decoded video and then conduct a subsequent process.
We study video captioning from a different perspective in compressed domain, which brings multi-fold advantages over the existing pipeline.
We propose a simple yet effective end-to-end transformer in the compressed domain for video captioning that enables learning from the compressed video for captioning.
arXiv Detail & Related papers (2023-09-22T13:43:22Z) - Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End
3D Dense Captioning [37.44886367452029]
3D dense captioning requires a model to translate its understanding of an input 3D scene into several captions associated with different object regions.
Existing methods adopt a sophisticated "detect-then-describe" pipeline, which builds explicit relation modules upon a 3D detector with numerous hand-crafted components.
We first propose Vote2Cap-DETR, a simple-yet-effective transformer framework that decouples the decoding process of caption generation and object localization through parallel decoding.
arXiv Detail & Related papers (2023-09-06T13:43:27Z) - X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D
Dense Captioning [71.36623596807122]
3D dense captioning aims to describe individual objects by natural language in 3D scenes, where 3D scenes are usually represented as RGB-D scans or point clouds.
In this study, we investigate a cross-modal knowledge transfer using Transformer for 3D dense captioning, X-Trans2Cap, to effectively boost the performance of single-modal 3D caption.
arXiv Detail & Related papers (2022-03-02T03:35:37Z) - End-to-End Dense Video Captioning with Parallel Decoding [53.34238344647624]
We propose a simple yet effective framework for end-to-end dense video captioning with parallel decoding (PDVC)
PDVC precisely segments the video into a number of event pieces under the holistic understanding of the video content.
experiments on ActivityNet Captions and YouCook2 show that PDVC is capable of producing high-quality captioning results.
arXiv Detail & Related papers (2021-08-17T17:39:15Z) - Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor
Segmentation [90.74732705236336]
Language-queried video actor segmentation aims to predict the pixel-mask of the actor which performs the actions described by a natural language query in the target frames.
We propose a collaborative spatial-temporal encoder-decoder framework which contains a 3D temporal encoder over the video clip to recognize the queried actions, and a 2D spatial encoder over the target frame to accurately segment the queried actors.
arXiv Detail & Related papers (2021-05-14T13:27:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.