CVSformer: Cross-View Synthesis Transformer for Semantic Scene
Completion
- URL: http://arxiv.org/abs/2307.07938v1
- Date: Sun, 16 Jul 2023 04:08:03 GMT
- Title: CVSformer: Cross-View Synthesis Transformer for Semantic Scene
Completion
- Authors: Haotian Dong (1), Enhui Ma (1), Lubo Wang (1), Miaohui Wang (2),
Wuyuan Xie (2), Qing Guo (3), Ping Li (4), Lingyu Liang (5), Kairui Yang (6),
Di Lin (1) ((1) Tianjin University, (2) Shenzhen University, (3) A*STAR, (4)
The Hong Kong Polytechnic University, (5) South China University of
Technology, (6) Alibaba Damo Academy)
- Abstract summary: We propose Cross-View Synthesis Transformer (CVSformer), which consists of Multi-View Feature Synthesis and Cross-View Transformer for learning cross-view object relationships.
We use the enhanced features to predict the geometric occupancies and semantic labels of all voxels.
We evaluate CVSformer on public datasets, where CVSformer yields state-of-the-art results.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semantic scene completion (SSC) requires an accurate understanding of the
geometric and semantic relationships between the objects in the 3D scene for
reasoning the occluded objects. The popular SSC methods voxelize the 3D
objects, allowing the deep 3D convolutional network (3D CNN) to learn the
object relationships from the complex scenes. However, the current networks
lack the controllable kernels to model the object relationship across multiple
views, where appropriate views provide the relevant information for suggesting
the existence of the occluded objects. In this paper, we propose Cross-View
Synthesis Transformer (CVSformer), which consists of Multi-View Feature
Synthesis and Cross-View Transformer for learning cross-view object
relationships. In the multi-view feature synthesis, we use a set of 3D
convolutional kernels rotated differently to compute the multi-view features
for each voxel. In the cross-view transformer, we employ the cross-view fusion
to comprehensively learn the cross-view relationships, which form useful
information for enhancing the features of individual views. We use the enhanced
features to predict the geometric occupancies and semantic labels of all
voxels. We evaluate CVSformer on public datasets, where CVSformer yields
state-of-the-art results.
Related papers
- Dynamic Scene Understanding through Object-Centric Voxelization and Neural Rendering [57.895846642868904]
We present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning.
voxelization infers per-object occupancy probabilities at individual spatial locations.
Our approach integrates 2D semantic features to create 3D semantic grids, representing the scene through multiple disentangled voxel grids.
arXiv Detail & Related papers (2024-07-30T15:33:58Z) - Variational Inference for Scalable 3D Object-centric Learning [19.445804699433353]
We tackle the task of scalable unsupervised object-centric representation learning on 3D scenes.
Existing approaches to object-centric representation learning show limitations in generalizing to larger scenes.
We propose to learn view-invariant 3D object representations in localized object coordinate systems.
arXiv Detail & Related papers (2023-09-25T10:23:40Z) - Self-supervised Learning by View Synthesis [62.27092994474443]
We present view-synthesis autoencoders (VSA) in this paper, which is a self-supervised learning framework designed for vision transformers.
In each iteration, the input to VSA is one view (or multiple views) of a 3D object and the output is a synthesized image in another target pose.
arXiv Detail & Related papers (2023-04-22T06:12:13Z) - Viewpoint Equivariance for Multi-View 3D Object Detection [35.4090127133834]
State-of-the-art methods focus on reasoning and decoding object bounding boxes from multi-view camera input.
We introduce VEDet, a novel 3D object detection framework that exploits 3D multi-view geometry.
arXiv Detail & Related papers (2023-03-25T19:56:41Z) - Object Scene Representation Transformer [56.40544849442227]
We introduce Object Scene Representation Transformer (OSRT), a 3D-centric model in which individual object representations naturally emerge through novel view synthesis.
OSRT scales to significantly more complex scenes with larger diversity of objects and backgrounds than existing methods.
It is multiple orders of magnitude faster at compositional rendering thanks to its light field parametrization and the novel Slot Mixer decoder.
arXiv Detail & Related papers (2022-06-14T15:40:47Z) - VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View
Selection and Fusion [68.68537312256144]
VoRTX is an end-to-end volumetric 3D reconstruction network using transformers for wide-baseline, multi-view feature fusion.
We train our model on ScanNet and show that it produces better reconstructions than state-of-the-art methods.
arXiv Detail & Related papers (2021-12-01T02:18:11Z) - Self-Supervised Multi-View Learning via Auto-Encoding 3D Transformations [61.870882736758624]
We propose a novel self-supervised paradigm to learn Multi-View Transformation Equivariant Representations (MV-TER)
Specifically, we perform a 3D transformation on a 3D object, and obtain multiple views before and after the transformation via projection.
Then, we self-train a representation to capture the intrinsic 3D object representation by decoding 3D transformation parameters from the fused feature representations of multiple views before and after the transformation.
arXiv Detail & Related papers (2021-03-01T06:24:17Z) - Stable View Synthesis [100.86844680362196]
We present Stable View Synthesis (SVS)
Given a set of source images depicting a scene from freely distributed viewpoints, SVS synthesizes new views of the scene.
SVS outperforms state-of-the-art view synthesis methods both quantitatively and qualitatively on three diverse real-world datasets.
arXiv Detail & Related papers (2020-11-14T07:24:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.