Self-supervised Wide Baseline Visual Servoing via 3D Equivariance
- URL: http://arxiv.org/abs/2209.05432v1
- Date: Mon, 12 Sep 2022 17:38:26 GMT
- Title: Self-supervised Wide Baseline Visual Servoing via 3D Equivariance
- Authors: Jinwook Huh, Jungseok Hong, Suveer Garg, Hyun Soo Park, and Volkan
Isler
- Abstract summary: This paper presents a novel self-supervised visual servoing method for wide baseline images.
Existing approaches that regress absolute camera pose with respect to an object require 3D ground truth data of the object.
Ours yields more than 35% average distance error reduction and more than 90% success rate with 3cm error tolerance.
- Score: 35.93323183558956
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One of the challenging input settings for visual servoing is when the initial
and goal camera views are far apart. Such settings are difficult because the
wide baseline can cause drastic changes in object appearance and cause
occlusions. This paper presents a novel self-supervised visual servoing method
for wide baseline images which does not require 3D ground truth supervision.
Existing approaches that regress absolute camera pose with respect to an object
require 3D ground truth data of the object in the forms of 3D bounding boxes or
meshes. We learn a coherent visual representation by leveraging a geometric
property called 3D equivariance-the representation is transformed in a
predictable way as a function of 3D transformation. To ensure that the
feature-space is faithful to the underlying geodesic space, a geodesic
preserving constraint is applied in conjunction with the equivariance. We
design a Siamese network that can effectively enforce these two geometric
properties without requiring 3D supervision. With the learned model, the
relative transformation can be inferred simply by following the gradient in the
learned space and used as feedback for closed-loop visual servoing. Our method
is evaluated on objects from the YCB dataset, showing meaningful outperformance
on a visual servoing task, or object alignment task with respect to
state-of-the-art approaches that use 3D supervision. Ours yields more than 35%
average distance error reduction and more than 90% success rate with 3cm error
tolerance.
Related papers
- Inverse Neural Rendering for Explainable Multi-Object Tracking [35.072142773300655]
We recast 3D multi-object tracking from RGB cameras as an emphInverse Rendering (IR) problem.
We optimize an image loss over generative latent spaces that inherently disentangle shape and appearance properties.
We validate the generalization and scaling capabilities of our method by learning the generative prior exclusively from synthetic data.
arXiv Detail & Related papers (2024-04-18T17:37:53Z) - Sculpt3D: Multi-View Consistent Text-to-3D Generation with Sparse 3D Prior [57.986512832738704]
We present a new framework Sculpt3D that equips the current pipeline with explicit injection of 3D priors from retrieved reference objects without re-training the 2D diffusion model.
Specifically, we demonstrate that high-quality and diverse 3D geometry can be guaranteed by keypoints supervision through a sparse ray sampling approach.
These two decoupled designs effectively harness 3D information from reference objects to generate 3D objects while preserving the generation quality of the 2D diffusion model.
arXiv Detail & Related papers (2024-03-14T07:39:59Z) - Text2Control3D: Controllable 3D Avatar Generation in Neural Radiance
Fields using Geometry-Guided Text-to-Image Diffusion Model [39.64952340472541]
We propose a controllable text-to-3D avatar generation method whose facial expression is controllable.
Our main strategy is to construct the 3D avatar in Neural Radiance Fields (NeRF) optimized with a set of controlled viewpoint-aware images.
We demonstrate the empirical results and discuss the effectiveness of our method.
arXiv Detail & Related papers (2023-09-07T08:14:46Z) - BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown
Objects [89.2314092102403]
We present a near real-time method for 6-DoF tracking of an unknown object from a monocular RGBD video sequence.
Our method works for arbitrary rigid objects, even when visual texture is largely absent.
arXiv Detail & Related papers (2023-03-24T17:13:49Z) - Explicit3D: Graph Network with Spatial Inference for Single Image 3D
Object Detection [35.85544715234846]
We propose a dynamic sparse graph pipeline named Explicit3D based on object geometry and semantics features.
Our experimental results on the SUN RGB-D dataset demonstrate that our Explicit3D achieves better performance balance than the-state-of-the-art.
arXiv Detail & Related papers (2023-02-13T16:19:54Z) - MvDeCor: Multi-view Dense Correspondence Learning for Fine-grained 3D
Segmentation [91.6658845016214]
We propose to utilize self-supervised techniques in the 2D domain for fine-grained 3D shape segmentation tasks.
We render a 3D shape from multiple views, and set up a dense correspondence learning task within the contrastive learning framework.
As a result, the learned 2D representations are view-invariant and geometrically consistent.
arXiv Detail & Related papers (2022-08-18T00:48:15Z) - Homography Loss for Monocular 3D Object Detection [54.04870007473932]
A differentiable loss function, termed as Homography Loss, is proposed to achieve the goal, which exploits both 2D and 3D information.
Our method yields the best performance compared with the other state-of-the-arts by a large margin on KITTI 3D datasets.
arXiv Detail & Related papers (2022-04-02T03:48:03Z) - DProST: 6-DoF Object Pose Estimation Using Space Carving and Dynamic
Projective Spatial Transformer [20.291172201922084]
Most deep learning-based pose estimation methods require CAD data to use 3D intermediate representations or project 2D appearance.
We propose a new pose estimation system consisting of a space carving module that reconstructs a reference 3D feature to replace the CAD data.
Also, we overcome the self-occlusion problem by a new Bidirectional Z-buffering (BiZ-buffer) method, which extracts both the front view and the self-occluded back view of the object.
arXiv Detail & Related papers (2021-12-16T10:39:09Z) - Neural Articulated Radiance Field [90.91714894044253]
We present Neural Articulated Radiance Field (NARF), a novel deformable 3D representation for articulated objects learned from images.
Experiments show that the proposed method is efficient and can generalize well to novel poses.
arXiv Detail & Related papers (2021-04-07T13:23:14Z) - 3D Object Recognition By Corresponding and Quantizing Neural 3D Scene
Representations [29.61554189447989]
We propose a system that learns to detect objects and infer their 3D poses in RGB-D images.
Many existing systems can identify objects and infer 3D poses, but they heavily rely on human labels and 3D annotations.
arXiv Detail & Related papers (2020-10-30T13:56:09Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.