6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-based
Instance Representation Learning
- URL: http://arxiv.org/abs/2110.04792v1
- Date: Sun, 10 Oct 2021 13:34:16 GMT
- Title: 6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-based
Instance Representation Learning
- Authors: Lu Zou and Zhangjin Huang
- Abstract summary: 6D-ViT is a transformer-based instance representation learning network.
It is suitable for highly accurate category-level object pose estimation on RGB-D images.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents 6D-ViT, a transformer-based instance representation
learning network, which is suitable for highly accurate category-level object
pose estimation on RGB-D images. Specifically, a novel two-stream
encoder-decoder framework is dedicated to exploring complex and powerful
instance representations from RGB images, point clouds and categorical shape
priors. For this purpose, the whole framework consists of two main branches,
named Pixelformer and Pointformer. The Pixelformer contains a pyramid
transformer encoder with an all-MLP decoder to extract pixelwise appearance
representations from RGB images, while the Pointformer relies on a cascaded
transformer encoder and an all-MLP decoder to acquire the pointwise geometric
characteristics from point clouds. Then, dense instance representations (i.e.,
correspondence matrix, deformation field) are obtained from a multi-source
aggregation network with shape priors, appearance and geometric information as
input. Finally, the instance 6D pose is computed by leveraging the
correspondence among dense representations, shape priors, and the instance
point clouds. Extensive experiments on both synthetic and real-world datasets
demonstrate that the proposed 3D instance representation learning framework
achieves state-of-the-art performance on both datasets, and significantly
outperforms all existing methods.
Related papers
- Boosting Cross-Domain Point Classification via Distilling Relational Priors from 2D Transformers [59.0181939916084]
Traditional 3D networks mainly focus on local geometric details and ignore the topological structure between local geometries.
We propose a novel Priors Distillation (RPD) method to extract priors from the well-trained transformers on massive images.
Experiments on the PointDA-10 and the Sim-to-Real datasets verify that the proposed method consistently achieves the state-of-the-art performance of UDA for point cloud classification.
arXiv Detail & Related papers (2024-07-26T06:29:09Z) - TransPose: 6D Object Pose Estimation with Geometry-Aware Transformer [16.674933679692728]
TransPose is a novel 6D pose framework that exploits Transformer with geometry-aware module to develop better learning of point cloud feature representations.
TransPose achieves competitive results on three benchmark datasets.
arXiv Detail & Related papers (2023-10-25T01:24:12Z) - PSFormer: Point Transformer for 3D Salient Object Detection [8.621996554264275]
PSFormer is an encoder-decoder network that takes full advantage of transformers to model contextual information.
In the encoder, we develop a Point Context Transformer (PCT) module to capture region contextual features at the point level.
In the decoder, we develop a Scene Context Transformer (SCT) module to learn context representations at the scene level.
arXiv Detail & Related papers (2022-10-28T06:34:28Z) - Bridged Transformer for Vision and Point Cloud 3D Object Detection [92.86856146086316]
Bridged Transformer (BrT) is an end-to-end architecture for 3D object detection.
BrT learns to identify 3D and 2D object bounding boxes from both points and image patches.
We experimentally show that BrT surpasses state-of-the-art methods on SUN RGB-D and ScanNetV2 datasets.
arXiv Detail & Related papers (2022-10-04T05:44:22Z) - PointMCD: Boosting Deep Point Cloud Encoders via Multi-view Cross-modal
Distillation for 3D Shape Recognition [55.38462937452363]
We propose a unified multi-view cross-modal distillation architecture, including a pretrained deep image encoder as the teacher and a deep point encoder as the student.
By pair-wise aligning multi-view visual and geometric descriptors, we can obtain more powerful deep point encoders without exhausting and complicated network modification.
arXiv Detail & Related papers (2022-07-07T07:23:20Z) - FFB6D: A Full Flow Bidirectional Fusion Network for 6D Pose Estimation [54.666329929930455]
We present FFB6D, a Bidirectional fusion network designed for 6D pose estimation from a single RGBD image.
We learn to combine appearance and geometry information for representation learning as well as output representation selection.
Our method outperforms the state-of-the-art by large margins on several benchmarks.
arXiv Detail & Related papers (2021-03-03T08:07:29Z) - Learning Geometry-Disentangled Representation for Complementary
Understanding of 3D Object Point Cloud [50.56461318879761]
We propose Geometry-Disentangled Attention Network (GDANet) for 3D image processing.
GDANet disentangles point clouds into contour and flat part of 3D objects, respectively denoted by sharp and gentle variation components.
Experiments on 3D object classification and segmentation benchmarks demonstrate that GDANet achieves the state-of-the-arts with fewer parameters.
arXiv Detail & Related papers (2020-12-20T13:35:00Z) - Shape Prior Deformation for Categorical 6D Object Pose and Size
Estimation [62.618227434286]
We present a novel learning approach to recover the 6D poses and sizes of unseen object instances from an RGB-D image.
We propose a deep network to reconstruct the 3D object model by explicitly modeling the deformation from a pre-learned categorical shape prior.
arXiv Detail & Related papers (2020-07-16T16:45:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.