SIMstack: A Generative Shape and Instance Model for Unordered Object
Stacks
- URL: http://arxiv.org/abs/2103.16442v1
- Date: Tue, 30 Mar 2021 15:42:43 GMT
- Title: SIMstack: A Generative Shape and Instance Model for Unordered Object
Stacks
- Authors: Zoe Landgraf, Raluca Scona, Tristan Laidlow, Stephen James, Stefan
Leutenegger, Andrew J. Davison
- Abstract summary: We propose a depth-conditioned Variational Auto-Encoder (VAE) trained on a dataset of objects stacked under physics simulation.
We formulate instance segmentation as a centre voting task which allows for class-agnostic detection and doesn't require setting the maximum number of objects in the scene.
Our method has practical applications in providing robots some of the ability humans have to make rapid intuitive inferences of partially observed scenes.
- Score: 38.042876641457255
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: By estimating 3D shape and instances from a single view, we can capture
information about an environment quickly, without the need for comprehensive
scanning and multi-view fusion. Solving this task for composite scenes (such as
object stacks) is challenging: occluded areas are not only ambiguous in shape
but also in instance segmentation; multiple decompositions could be valid. We
observe that physics constrains decomposition as well as shape in occluded
regions and hypothesise that a latent space learned from scenes built under
physics simulation can serve as a prior to better predict shape and instances
in occluded regions. To this end we propose SIMstack, a depth-conditioned
Variational Auto-Encoder (VAE), trained on a dataset of objects stacked under
physics simulation. We formulate instance segmentation as a centre voting task
which allows for class-agnostic detection and doesn't require setting the
maximum number of objects in the scene. At test time, our model can generate 3D
shape and instance segmentation from a single depth view, probabilistically
sampling proposals for the occluded region from the learned latent space. Our
method has practical applications in providing robots some of the ability
humans have to make rapid intuitive inferences of partially observed scenes. We
demonstrate an application for precise (non-disruptive) object grasping of
unknown objects from a single depth view.
Related papers
- Learning Embeddings with Centroid Triplet Loss for Object Identification in Robotic Grasping [14.958823096408175]
Foundation models are a strong trend in deep learning and computer vision.
Here, we focus on training such an object identification model.
Key solution to train such a model is the centroid triplet loss (CTL), which aggregates image features to their centroids.
arXiv Detail & Related papers (2024-04-09T13:01:26Z) - Robust Shape Fitting for 3D Scene Abstraction [33.84212609361491]
In particular, we can describe man-made environments using volumetric primitives such as cuboids or cylinders.
We propose a robust estimator for primitive fitting, which meaningfully abstracts complex real-world environments using cuboids.
Results on the NYU Depth v2 dataset demonstrate that the proposed algorithm successfully abstracts cluttered real-world 3D scene layouts.
arXiv Detail & Related papers (2024-03-15T16:37:43Z) - Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast
Contrastive Fusion [110.84357383258818]
We propose a novel approach to lift 2D segments to 3D and fuse them by means of a neural field representation.
The core of our approach is a slow-fast clustering objective function, which is scalable and well-suited for scenes with a large number of objects.
Our approach outperforms the state-of-the-art on challenging scenes from the ScanNet, Hypersim, and Replica datasets.
arXiv Detail & Related papers (2023-06-07T17:57:45Z) - 3D-IntPhys: Towards More Generalized 3D-grounded Visual Intuitive
Physics under Challenging Scenes [68.66237114509264]
We present a framework capable of learning 3D-grounded visual intuitive physics models from videos of complex scenes with fluids.
We show our model can make long-horizon future predictions by learning from raw images and significantly outperforms models that do not employ an explicit 3D representation space.
arXiv Detail & Related papers (2023-04-22T19:28:49Z) - Category-level Shape Estimation for Densely Cluttered Objects [94.64287790278887]
We propose a category-level shape estimation method for densely cluttered objects.
Our framework partitions each object in the clutter via the multi-view visual information fusion.
Experiments in the simulated environment and real world show that our method achieves high shape estimation accuracy.
arXiv Detail & Related papers (2023-02-23T13:00:17Z) - ALSO: Automotive Lidar Self-supervision by Occupancy estimation [70.70557577874155]
We propose a new self-supervised method for pre-training the backbone of deep perception models operating on point clouds.
The core idea is to train the model on a pretext task which is the reconstruction of the surface on which the 3D points are sampled.
The intuition is that if the network is able to reconstruct the scene surface, given only sparse input points, then it probably also captures some fragments of semantic information.
arXiv Detail & Related papers (2022-12-12T13:10:19Z) - Generative Category-Level Shape and Pose Estimation with Semantic
Primitives [27.692997522812615]
We propose a novel framework for category-level object shape and pose estimation from a single RGB-D image.
To handle the intra-category variation, we adopt a semantic primitive representation that encodes diverse shapes into a unified latent space.
We show that the proposed method achieves SOTA pose estimation performance and better generalization in the real-world dataset.
arXiv Detail & Related papers (2022-10-03T17:51:54Z) - ShAPO: Implicit Representations for Multi-Object Shape, Appearance, and
Pose Optimization [40.36229450208817]
We present ShAPO, a method for joint multi-object detection, 3D textured reconstruction, 6D object pose and size estimation.
Key to ShAPO is a single-shot pipeline to regress shape, appearance and pose latent codes along with the masks of each object instance.
Our method significantly out-performs all baselines on the NOCS dataset with an 8% absolute improvement in mAP for 6D pose estimation.
arXiv Detail & Related papers (2022-07-27T17:59:31Z) - Cuboids Revisited: Learning Robust 3D Shape Fitting to Single RGB Images [44.223070672713455]
In particular, man-made environments commonly consist of volumetric primitives such as cuboids or cylinders.
Previous approaches directly estimate shape parameters from a 2D or 3D input, and are only able to reproduce simple objects.
We propose a robust estimator for primitive fitting, which can meaningfully abstract real-world environments using cuboids.
arXiv Detail & Related papers (2021-05-05T13:36:00Z) - From Points to Multi-Object 3D Reconstruction [71.17445805257196]
We propose a method to detect and reconstruct multiple 3D objects from a single RGB image.
A keypoint detector localizes objects as center points and directly predicts all object properties, including 9-DoF bounding boxes and 3D shapes.
The presented approach performs lightweight reconstruction in a single-stage, it is real-time capable, fully differentiable and end-to-end trainable.
arXiv Detail & Related papers (2020-12-21T18:52:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.