UniT3D: A Unified Transformer for 3D Dense Captioning and Visual
Grounding
- URL: http://arxiv.org/abs/2212.00836v1
- Date: Thu, 1 Dec 2022 19:45:09 GMT
- Title: UniT3D: A Unified Transformer for 3D Dense Captioning and Visual
Grounding
- Authors: Dave Zhenyu Chen, Ronghang Hu, Xinlei Chen, Matthias Nie{\ss}ner,
Angel X. Chang
- Abstract summary: Performing 3D dense captioning and visual grounding requires a common and shared understanding of the underlying multimodal relationships.
We propose UniT3D, a transformer-based architecture for jointly solving 3D visual grounding and dense captioning.
- Score: 41.15622591021133
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Performing 3D dense captioning and visual grounding requires a common and
shared understanding of the underlying multimodal relationships. However,
despite some previous attempts on connecting these two related tasks with
highly task-specific neural modules, it remains understudied how to explicitly
depict their shared nature to learn them simultaneously. In this work, we
propose UniT3D, a simple yet effective fully unified transformer-based
architecture for jointly solving 3D visual grounding and dense captioning.
UniT3D enables learning a strong multimodal representation across the two tasks
through a supervised joint pre-training scheme with bidirectional and
seq-to-seq objectives. With a generic architecture design, UniT3D allows
expanding the pre-training scope to more various training sources such as the
synthesized data from 2D prior knowledge to benefit 3D vision-language tasks.
Extensive experiments and analysis demonstrate that UniT3D obtains significant
gains for 3D dense captioning and visual grounding.
Related papers
- Multi-modal Relation Distillation for Unified 3D Representation Learning [30.942281325891226]
Multi-modal Relation Distillation (MRD) is a tri-modal pre-training framework designed to distill reputable large Vision-Language Models (VLM) into 3D backbones.
MRD aims to capture both intra-relations within each modality as well as cross-relations between different modalities and produce more discriminative 3D shape representations.
arXiv Detail & Related papers (2024-07-19T03:43:48Z) - InstructLayout: Instruction-Driven 2D and 3D Layout Synthesis with Semantic Graph Prior [23.536285325566013]
Comprehending natural language instructions is a charming property for both 2D and 3D layout synthesis systems.
Existing methods implicitly model object joint distributions and express object relations, hindering generation's controllability synthesis systems.
We introduce Instruct, a novel generative framework that integrates a semantic graph prior and a layout decoder.
arXiv Detail & Related papers (2024-07-10T12:13:39Z) - Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns.
A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation [67.56268991234371]
OV-Uni3DETR achieves the state-of-the-art performance on various scenarios, surpassing existing methods by more than 6% on average.
Code and pre-trained models will be released later.
arXiv Detail & Related papers (2024-03-28T17:05:04Z) - Multi-task Learning with 3D-Aware Regularization [55.97507478913053]
We propose a structured 3D-aware regularizer which interfaces multiple tasks through the projection of features extracted from an image encoder to a shared 3D feature space.
We show that the proposed method is architecture agnostic and can be plugged into various prior multi-task backbones to improve their performance.
arXiv Detail & Related papers (2023-10-02T08:49:56Z) - Multi-CLIP: Contrastive Vision-Language Pre-training for Question
Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore.
We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z) - TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D
Visual Grounding [15.617150859765024]
We exploit Transformer for its natural suitability on permutation-invariant 3D point clouds data.
We propose a TransRefer3D network to extract entity-and-relation aware multimodal context.
Our proposed model significantly outperforms existing approaches by up to 10.6%.
arXiv Detail & Related papers (2021-08-05T05:47:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.