KVN: Keypoints Voting Network with Differentiable RANSAC for Stereo Pose
Estimation
- URL: http://arxiv.org/abs/2307.11543v3
- Date: Mon, 4 Mar 2024 10:49:10 GMT
- Title: KVN: Keypoints Voting Network with Differentiable RANSAC for Stereo Pose
Estimation
- Authors: Ivano Donadi and Alberto Pretto
- Abstract summary: We introduce a differentiable RANSAC layer into a well-known monocular pose estimation network.
We show that the differentiable RANSAC plays a role in the accuracy of the proposed layer.
- Score: 1.1603243575080535
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Object pose estimation is a fundamental computer vision task exploited in
several robotics and augmented reality applications. Many established
approaches rely on predicting 2D-3D keypoint correspondences using RANSAC
(Random sample consensus) and estimating the object pose using the PnP
(Perspective-n-Point) algorithm. Being RANSAC non-differentiable,
correspondences cannot be directly learned in an end-to-end fashion. In this
paper, we address the stereo image-based object pose estimation problem by i)
introducing a differentiable RANSAC layer into a well-known monocular pose
estimation network; ii) exploiting an uncertainty-driven multi-view PnP solver
which can fuse information from multiple views. We evaluate our approach on a
challenging public stereo object pose estimation dataset and a custom-built
dataset we call Transparent Tableware Dataset (TTD), yielding state-of-the-art
results against other recent approaches. Furthermore, in our ablation study, we
show that the differentiable RANSAC layer plays a significant role in the
accuracy of the proposed method. We release with this paper the code of our
method and the TTD dataset.
Related papers
- CVAM-Pose: Conditional Variational Autoencoder for Multi-Object Monocular Pose Estimation [3.5379836919221566]
Estimating rigid objects' poses is one of the fundamental problems in computer vision.
This paper presents a novel approach, CVAM-Pose, for multi-object monocular pose estimation.
arXiv Detail & Related papers (2024-10-11T17:26:27Z) - Divide and Conquer: Improving Multi-Camera 3D Perception with 2D Semantic-Depth Priors and Input-Dependent Queries [30.17281824826716]
Existing techniques often neglect the synergistic effects of semantic and depth cues, leading to classification and position estimation errors.
We propose an input-aware Transformer framework that leverages Semantics and Depth as priors.
Our approach involves the use of an S-D that explicitly models semantic and depth priors, thereby disentangling the learning process of object categorization and position estimation.
arXiv Detail & Related papers (2024-08-13T13:51:34Z) - Category-level Object Detection, Pose Estimation and Reconstruction from Stereo Images [15.921719523588996]
Existing monocular and RGB-D methods suffer from scale ambiguity due to missing or depth measurements.
We present CODERS, a one-stage approach for Category-level Object Detection, pose Estimation and Reconstruction from Stereo images.
Our dataset, code, and demos will be available on our project page.
arXiv Detail & Related papers (2024-07-09T15:59:03Z) - Towards Unified 3D Object Detection via Algorithm and Data Unification [70.27631528933482]
We build the first unified multi-modal 3D object detection benchmark MM- Omni3D and extend the aforementioned monocular detector to its multi-modal version.
We name the designed monocular and multi-modal detectors as UniMODE and MM-UniMODE, respectively.
arXiv Detail & Related papers (2024-02-28T18:59:31Z) - Geometric-aware Pretraining for Vision-centric 3D Object Detection [77.7979088689944]
We propose a novel geometric-aware pretraining framework called GAPretrain.
GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors.
We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively.
arXiv Detail & Related papers (2023-04-06T14:33:05Z) - CPPF++: Uncertainty-Aware Sim2Real Object Pose Estimation by Vote Aggregation [67.12857074801731]
We introduce a novel method, CPPF++, designed for sim-to-real pose estimation.
To address the challenge posed by vote collision, we propose a novel approach that involves modeling the voting uncertainty.
We incorporate several innovative modules, including noisy pair filtering, online alignment optimization, and a feature ensemble.
arXiv Detail & Related papers (2022-11-24T03:27:00Z) - Simultaneous Multiple Object Detection and Pose Estimation using 3D
Model Infusion with Monocular Vision [21.710141497071373]
Multiple object detection and pose estimation are vital computer vision tasks.
We propose simultaneous neural modeling of both using monocular vision and 3D model infusion.
Our Simultaneous Multiple Object detection and Pose Estimation network (SMOPE-Net) is an end-to-end trainable multitasking network.
arXiv Detail & Related papers (2022-11-21T05:18:56Z) - 3DMODT: Attention-Guided Affinities for Joint Detection & Tracking in 3D
Point Clouds [95.54285993019843]
We propose a method for joint detection and tracking of multiple objects in 3D point clouds.
Our model exploits temporal information employing multiple frames to detect objects and track them in a single network.
arXiv Detail & Related papers (2022-11-01T20:59:38Z) - Semantic keypoint-based pose estimation from single RGB frames [64.80395521735463]
We present an approach to estimating the continuous 6-DoF pose of an object from a single RGB image.
The approach combines semantic keypoints predicted by a convolutional network (convnet) with a deformable shape model.
We show that our approach can accurately recover the 6-DoF object pose for both instance- and class-based scenarios.
arXiv Detail & Related papers (2022-04-12T15:03:51Z) - Self-supervised Learning of 3D Object Understanding by Data Association
and Landmark Estimation for Image Sequence [15.815583594196488]
3D object under-standing from 2D image is a challenging task that infers ad-ditional dimension from reduced-dimensional information.
It is challenging to obtain large amount of 3D dataset since achieving 3D annotation is expensive andtime-consuming.
We propose a strategy to exploit multipleobservations of the object in the image sequence in orderto surpass the self-performance.
arXiv Detail & Related papers (2021-04-14T18:59:08Z) - Object-Centric Multi-View Aggregation [86.94544275235454]
We present an approach for aggregating a sparse set of views of an object in order to compute a semi-implicit 3D representation in the form of a volumetric feature grid.
Key to our approach is an object-centric canonical 3D coordinate system into which views can be lifted, without explicit camera pose estimation.
We show that computing a symmetry-aware mapping from pixels to the canonical coordinate system allows us to better propagate information to unseen regions.
arXiv Detail & Related papers (2020-07-20T17:38:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.