Generating Visual Spatial Description via Holistic 3D Scene
Understanding
- URL: http://arxiv.org/abs/2305.11768v2
- Date: Thu, 25 May 2023 04:20:46 GMT
- Title: Generating Visual Spatial Description via Holistic 3D Scene
Understanding
- Authors: Yu Zhao, Hao Fei, Wei Ji, Jianguo Wei, Meishan Zhang, Min Zhang,
Tat-Seng Chua
- Abstract summary: Visual spatial description (VSD) aims to generate texts that describe the spatial relations of the given objects within images.
With an external 3D scene extractor, we obtain the 3D objects and scene features for input images.
We construct a target object-centered 3D spatial scene graph (Go3D-S2G), such that we model the spatial semantics of target objects within the holistic 3D scenes.
- Score: 88.99773815159345
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Visual spatial description (VSD) aims to generate texts that describe the
spatial relations of the given objects within images. Existing VSD work merely
models the 2D geometrical vision features, thus inevitably falling prey to the
problem of skewed spatial understanding of target objects. In this work, we
investigate the incorporation of 3D scene features for VSD. With an external 3D
scene extractor, we obtain the 3D objects and scene features for input images,
based on which we construct a target object-centered 3D spatial scene graph
(Go3D-S2G), such that we model the spatial semantics of target objects within
the holistic 3D scenes. Besides, we propose a scene subgraph selecting
mechanism, sampling topologically-diverse subgraphs from Go3D-S2G, where the
diverse local structure features are navigated to yield spatially-diversified
text generation. Experimental results on two VSD datasets demonstrate that our
framework outperforms the baselines significantly, especially improving on the
cases with complex visual spatial relations. Meanwhile, our method can produce
more spatially-diversified generation. Code is available at
https://github.com/zhaoyucs/VSD.
Related papers
- Dynamic Scene Understanding through Object-Centric Voxelization and Neural Rendering [57.895846642868904]
We present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning.
voxelization infers per-object occupancy probabilities at individual spatial locations.
Our approach integrates 2D semantic features to create 3D semantic grids, representing the scene through multiple disentangled voxel grids.
arXiv Detail & Related papers (2024-07-30T15:33:58Z) - Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic Assisted Pseudo-labeling [9.440800948514449]
We propose a weakly-supervised 3D scene graph generation method via Visual-Linguistic Assisted Pseudo-labeling.
Our 3D-VLAP exploits the superior ability of current large-scale visual-linguistic models to align the semantics between texts and 2D images.
We design an edge self-attention based graph neural network to generate scene graphs of 3D point cloud scenes.
arXiv Detail & Related papers (2024-04-03T07:30:09Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - Volumetric Environment Representation for Vision-Language Navigation [66.04379819772764]
Vision-language navigation (VLN) requires an agent to navigate through a 3D environment based on visual observations and natural language instructions.
We introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells.
VER predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly.
arXiv Detail & Related papers (2024-03-21T06:14:46Z) - SceneWiz3D: Towards Text-guided 3D Scene Composition [134.71933134180782]
Existing approaches either leverage large text-to-image models to optimize a 3D representation or train 3D generators on object-centric datasets.
We introduce SceneWiz3D, a novel approach to synthesize high-fidelity 3D scenes from text.
arXiv Detail & Related papers (2023-12-13T18:59:30Z) - 3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding [58.924180772480504]
3D visual grounding aims to localize the target object in a 3D point cloud by a free-form language description.
We propose a relation-aware one-stage framework, named 3D Relative Position-aware Network (3-Net)
arXiv Detail & Related papers (2023-07-25T09:33:25Z) - Graph-DETR3D: Rethinking Overlapping Regions for Multi-View 3D Object
Detection [17.526914782562528]
We propose Graph-DETR3D to automatically aggregate multi-view imagery information through graph structure learning (GSL)
Our best model achieves 49.5 NDS on the nuScenes test leaderboard, achieving new state-of-the-art in comparison with various published image-view 3D object detectors.
arXiv Detail & Related papers (2022-04-25T12:10:34Z) - Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds [20.172702468478057]
Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding.
We propose a transformer-based encoder-decoder architecture, namely SpaCap3D, to transform objects into descriptions.
Our proposed SpaCap3D outperforms the baseline method Scan2Cap by 4.94% and 9.61% in CIDEr@0.5IoU, respectively.
arXiv Detail & Related papers (2022-04-22T13:07:37Z) - Disentangling 3D Prototypical Networks For Few-Shot Concept Learning [29.02523358573336]
We present neural architectures that disentangle RGB-D images into objects' shapes and styles and a map of the background scene.
Our networks incorporate architectural biases that reflect the image formation process, 3D geometry of the world scene, and shape-style interplay.
arXiv Detail & Related papers (2020-11-06T14:08:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.