Explore and Tell: Embodied Visual Captioning in 3D Environments
- URL: http://arxiv.org/abs/2308.10447v1
- Date: Mon, 21 Aug 2023 03:46:04 GMT
- Title: Explore and Tell: Embodied Visual Captioning in 3D Environments
- Authors: Anwen Hu, Shizhe Chen, Liang Zhang, Qin Jin
- Abstract summary: In real-world scenarios, a single image may not offer a good viewpoint, hindering fine-grained scene understanding.
We propose a novel task called Embodied Captioning, which equips visual captioning models with navigation capabilities.
We propose a Cascade Embodied Captioning model (CaBOT), which comprises of a navigator and a captioner, to tackle this task.
- Score: 83.00553567094998
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While current visual captioning models have achieved impressive performance,
they often assume that the image is well-captured and provides a complete view
of the scene. In real-world scenarios, however, a single image may not offer a
good viewpoint, hindering fine-grained scene understanding. To overcome this
limitation, we propose a novel task called Embodied Captioning, which equips
visual captioning models with navigation capabilities, enabling them to
actively explore the scene and reduce visual ambiguity from suboptimal
viewpoints. Specifically, starting at a random viewpoint, an agent must
navigate the environment to gather information from different viewpoints and
generate a comprehensive paragraph describing all objects in the scene. To
support this task, we build the ET-Cap dataset with Kubric simulator,
consisting of 10K 3D scenes with cluttered objects and three annotated
paragraphs per scene. We propose a Cascade Embodied Captioning model (CaBOT),
which comprises of a navigator and a captioner, to tackle this task. The
navigator predicts which actions to take in the environment, while the
captioner generates a paragraph description based on the whole navigation
trajectory. Extensive experiments demonstrate that our model outperforms other
carefully designed baselines. Our dataset, codes and models are available at
https://aim3-ruc.github.io/ExploreAndTell.
Related papers
- 3D Vision and Language Pretraining with Large-Scale Synthetic Data [28.45763758308814]
3D Vision-Language Pre-training aims to provide a pre-train model which can bridge 3D scenes with natural language.
We construct SynVL3D, a comprehensive synthetic scene-text corpus with 10K indoor scenes and 1M descriptions at object, view, and room levels.
We propose a synthetic-to-real domain adaptation in downstream task fine-tuning process to address the domain shift.
arXiv Detail & Related papers (2024-07-08T16:26:52Z) - View Selection for 3D Captioning via Diffusion Ranking [54.78058803763221]
Cap3D method renders 3D objects into 2D views for captioning using pre-trained models.
Some rendered views of 3D objects are atypical, deviating from the training data of standard image captioning models and causing hallucinations.
We present DiffuRank, a method that leverages a pre-trained text-to-3D model to assess the alignment between 3D objects and their 2D rendered views.
arXiv Detail & Related papers (2024-04-11T17:58:11Z) - Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level.
Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z) - HL Dataset: Visually-grounded Description of Scenes, Actions and
Rationales [5.010418546872244]
We present a dataset extending 14997 images from the COCO dataset, aligned with a new set of 134,973 human-annotated (high-level) captions.
We further extend this dataset with confidence scores collected from an independent set of readers, as well as a set of narrative captions generated synthetically.
arXiv Detail & Related papers (2023-02-23T17:30:18Z) - DEVICE: DEpth and VIsual ConcEpts Aware Transformer for TextCaps [10.87327544629769]
We propose a DEpth and VIsual ConcEpts Aware Transformer (DEVICE) for TextCaps.
Our DEVICE is capable of generalizing scenes more comprehensively and boosting the accuracy of described visual entities.
arXiv Detail & Related papers (2023-02-03T04:31:13Z) - DisCoScene: Spatially Disentangled Generative Radiance Fields for
Controllable 3D-aware Scene Synthesis [90.32352050266104]
DisCoScene is a 3Daware generative model for high-quality and controllable scene synthesis.
It disentangles the whole scene into object-centric generative fields by learning on only 2D images with the global-local discrimination.
We demonstrate state-of-the-art performance on many scene datasets, including the challenging outdoor dataset.
arXiv Detail & Related papers (2022-12-22T18:59:59Z) - OpenScene: 3D Scene Understanding with Open Vocabularies [73.1411930820683]
Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a model for a single task with supervision.
We propose OpenScene, an alternative approach where a model predicts dense features for 3D scene points that are co-embedded with text and image pixels in CLIP feature space.
This zero-shot approach enables task-agnostic training and open-vocabulary queries.
arXiv Detail & Related papers (2022-11-28T18:58:36Z) - Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds [20.172702468478057]
Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding.
We propose a transformer-based encoder-decoder architecture, namely SpaCap3D, to transform objects into descriptions.
Our proposed SpaCap3D outperforms the baseline method Scan2Cap by 4.94% and 9.61% in CIDEr@0.5IoU, respectively.
arXiv Detail & Related papers (2022-04-22T13:07:37Z) - Recognizing Scenes from Novel Viewpoints [99.90914180489456]
Humans can perceive scenes in 3D from a handful of 2D views. For AI agents, the ability to recognize a scene from any viewpoint given only a few images enables them to efficiently interact with the scene and its objects.
We propose a model which takes as input a few RGB images of a new scene and recognizes the scene from novel viewpoints by segmenting it into semantic categories.
arXiv Detail & Related papers (2021-12-02T18:59:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.