RayTran: 3D pose estimation and shape reconstruction of multiple objects
from videos with ray-traced transformers
- URL: http://arxiv.org/abs/2203.13296v1
- Date: Thu, 24 Mar 2022 18:49:12 GMT
- Title: RayTran: 3D pose estimation and shape reconstruction of multiple objects
from videos with ray-traced transformers
- Authors: Micha{\l} J. Tyszkiewicz, Kevis-Kokitsi Maninis, Stefan Popov,
Vittorio Ferrari
- Abstract summary: We propose a transformer-based neural network architecture for multi-object 3D reconstruction from RGB videos.
We exploit knowledge about the image formation process to significantly sparsify the attention weight matrix.
Compared to previous methods, our architecture is single stage, end-to-end trainable.
- Score: 41.499325832227626
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a transformer-based neural network architecture for multi-object
3D reconstruction from RGB videos. It relies on two alternative ways to
represent its knowledge: as a global 3D grid of features and an array of
view-specific 2D grids. We progressively exchange information between the two
with a dedicated bidirectional attention mechanism. We exploit knowledge about
the image formation process to significantly sparsify the attention weight
matrix, making our architecture feasible on current hardware, both in terms of
memory and computation. We attach a DETR-style head on top of the 3D feature
grid in order to detect the objects in the scene and to predict their 3D pose
and 3D shape. Compared to previous methods, our architecture is single stage,
end-to-end trainable, and it can reason holistically about a scene from
multiple video frames without needing a brittle tracking step. We evaluate our
method on the challenging Scan2CAD dataset, where we outperform (1) recent
state-of-the-art methods for 3D object pose estimation from RGB videos; and (2)
a strong alternative method combining Multi-view Stereo with RGB-D CAD
alignment. We plan to release our source code.
Related papers
- General Geometry-aware Weakly Supervised 3D Object Detection [62.26729317523975]
A unified framework is developed for learning 3D object detectors from RGB images and associated 2D boxes.
Experiments on KITTI and SUN-RGBD datasets demonstrate that our method yields surprisingly high-quality 3D bounding boxes with only 2D annotation.
arXiv Detail & Related papers (2024-07-18T17:52:08Z) - Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos [15.532504015622159]
Category-level 3D pose estimation is a fundamentally important problem in computer vision and robotics.
We tackle the problem of learning to estimate the category-level 3D pose only from casually taken object-centric videos.
arXiv Detail & Related papers (2024-07-05T09:43:05Z) - Viewpoint Equivariance for Multi-View 3D Object Detection [35.4090127133834]
State-of-the-art methods focus on reasoning and decoding object bounding boxes from multi-view camera input.
We introduce VEDet, a novel 3D object detection framework that exploits 3D multi-view geometry.
arXiv Detail & Related papers (2023-03-25T19:56:41Z) - Learning 3D Scene Priors with 2D Supervision [37.79852635415233]
We propose a new method to learn 3D scene priors of layout and shape without requiring any 3D ground truth.
Our method represents a 3D scene as a latent vector, from which we can progressively decode to a sequence of objects characterized by their class categories.
Experiments on 3D-FRONT and ScanNet show that our method outperforms state of the art in single-view reconstruction.
arXiv Detail & Related papers (2022-11-25T15:03:32Z) - Bridged Transformer for Vision and Point Cloud 3D Object Detection [92.86856146086316]
Bridged Transformer (BrT) is an end-to-end architecture for 3D object detection.
BrT learns to identify 3D and 2D object bounding boxes from both points and image patches.
We experimentally show that BrT surpasses state-of-the-art methods on SUN RGB-D and ScanNetV2 datasets.
arXiv Detail & Related papers (2022-10-04T05:44:22Z) - Weakly Supervised Learning of Multi-Object 3D Scene Decompositions Using
Deep Shape Priors [69.02332607843569]
PriSMONet is a novel approach for learning Multi-Object 3D scene decomposition and representations from single images.
A recurrent encoder regresses a latent representation of 3D shape, pose and texture of each object from an input RGB image.
We evaluate the accuracy of our model in inferring 3D scene layout, demonstrate its generative capabilities, assess its generalization to real images, and point out benefits of the learned representation.
arXiv Detail & Related papers (2020-10-08T14:49:23Z) - Tracking Emerges by Looking Around Static Scenes, with Neural 3D Mapping [23.456046776979903]
We propose to leverage multiview data of textitstatic points in arbitrary scenes (static or dynamic) to learn a neural 3D mapping module.
The neural 3D mapper consumes RGB-D data as input, and produces a 3D voxel grid of deep features as output.
We show that our unsupervised 3D object trackers outperform prior unsupervised 2D and 2.5D trackers, and approach the accuracy of supervised trackers.
arXiv Detail & Related papers (2020-08-04T02:59:23Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z) - DSGN: Deep Stereo Geometry Network for 3D Object Detection [79.16397166985706]
There is a large performance gap between image-based and LiDAR-based 3D object detectors.
Our method, called Deep Stereo Geometry Network (DSGN), significantly reduces this gap.
For the first time, we provide a simple and effective one-stage stereo-based 3D detection pipeline.
arXiv Detail & Related papers (2020-01-10T11:44:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.