PQ-Transformer: Jointly Parsing 3D Objects and Layouts from Point Clouds
- URL: http://arxiv.org/abs/2109.05566v1
- Date: Sun, 12 Sep 2021 17:31:59 GMT
- Title: PQ-Transformer: Jointly Parsing 3D Objects and Layouts from Point Clouds
- Authors: Xiaoxue Chen, Hao Zhao, Guyue Zhou, Ya-Qin Zhang
- Abstract summary: 3D scene understanding from point clouds plays a vital role for various robotic applications.
Current state-of-the-art methods use separate neural networks for different tasks like object detection or room layout estimation.
We propose the first transformer architecture that predicts 3D objects and layouts simultaneously.
- Score: 4.381579507834533
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 3D scene understanding from point clouds plays a vital role for various
robotic applications. Unfortunately, current state-of-the-art methods use
separate neural networks for different tasks like object detection or room
layout estimation. Such a scheme has two limitations: 1) Storing and running
several networks for different tasks are expensive for typical robotic
platforms. 2) The intrinsic structure of separate outputs are ignored and
potentially violated. To this end, we propose the first transformer
architecture that predicts 3D objects and layouts simultaneously, using point
cloud inputs. Unlike existing methods that either estimate layout keypoints or
edges, we directly parameterize room layout as a set of quads. As such, the
proposed architecture is termed as P(oint)Q(uad)-Transformer. Along with the
novel quad representation, we propose a tailored physical constraint loss
function that discourages object-layout interference. The quantitative and
qualitative evaluations on the public benchmark ScanNet show that the proposed
PQ-Transformer succeeds to jointly parse 3D objects and layouts, running at a
quasi-real-time (8.91 FPS) rate without efficiency-oriented optimization.
Moreover, the new physical constraint loss can improve strong baselines, and
the F1-score of the room layout is significantly promoted from 37.9% to 57.9%.
Related papers
- CabiNet: Scaling Neural Collision Detection for Object Rearrangement
with Procedural Scene Generation [54.68738348071891]
We first generate over 650K cluttered scenes - orders of magnitude more than prior work - in diverse everyday environments.
We render synthetic partial point clouds from this data and use it to train our CabiNet model architecture.
CabiNet is a collision model that accepts object and scene point clouds, captured from a single-view depth observation.
arXiv Detail & Related papers (2023-04-18T21:09:55Z) - Hierarchical Point Attention for Indoor 3D Object Detection [111.04397308495618]
This work proposes two novel attention operations as generic hierarchical designs for point-based transformer detectors.
First, we propose Multi-Scale Attention (MS-A) that builds multi-scale tokens from a single-scale input feature to enable more fine-grained feature learning.
Second, we propose Size-Adaptive Local Attention (Local-A) with adaptive attention regions for localized feature aggregation within bounding box proposals.
arXiv Detail & Related papers (2023-01-06T18:52:12Z) - Exploiting More Information in Sparse Point Cloud for 3D Single Object
Tracking [9.693724357115762]
3D single object tracking is a key task in 3D computer vision.
The sparsity of point clouds makes it difficult to compute the similarity and locate the object.
We propose a sparse-to-dense and transformer-based framework for 3D single object tracking.
arXiv Detail & Related papers (2022-10-02T13:38:30Z) - SEFormer: Structure Embedding Transformer for 3D Object Detection [22.88983416605276]
Structure-Embedding transFormer (SEFormer) can preserve local structure as traditional Transformer but also have the ability to encode the local structure.
SEFormer achieves 79.02% mAP, which is 1.2% higher than existing works.
arXiv Detail & Related papers (2022-09-05T03:38:12Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - Neural 3D Scene Reconstruction with the Manhattan-world Assumption [58.90559966227361]
This paper addresses the challenge of reconstructing 3D indoor scenes from multi-view images.
Planar constraints can be conveniently integrated into the recent implicit neural representation-based reconstruction methods.
The proposed method outperforms previous methods by a large margin on 3D reconstruction quality.
arXiv Detail & Related papers (2022-05-05T17:59:55Z) - Embracing Single Stride 3D Object Detector with Sparse Transformer [63.179720817019096]
In LiDAR-based 3D object detection for autonomous driving, the ratio of the object size to input scene size is significantly smaller compared to 2D detection cases.
Many 3D detectors directly follow the common practice of 2D detectors, which downsample the feature maps even after quantizing the point clouds.
We propose Single-stride Sparse Transformer (SST) to maintain the original resolution from the beginning to the end of the network.
arXiv Detail & Related papers (2021-12-13T02:12:02Z) - Dynamic Convolution for 3D Point Cloud Instance Segmentation [146.7971476424351]
We propose an approach to instance segmentation from 3D point clouds based on dynamic convolution.
We gather homogeneous points that have identical semantic categories and close votes for the geometric centroids.
The proposed approach is proposal-free, and instead exploits a convolution process that adapts to the spatial and semantic characteristics of each instance.
arXiv Detail & Related papers (2021-07-18T09:05:16Z) - HyperFlow: Representing 3D Objects as Surfaces [19.980044265074298]
We present a novel generative model that leverages hypernetworks to create continuous 3D object representations in a form of lightweight surfaces (meshes) directly out of point clouds.
We obtain continuous mesh-based object representations that yield better qualitative results than competing approaches.
arXiv Detail & Related papers (2020-06-15T19:18:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.