Related papers: TransPoser: Transformer as an Optimizer for Joint Object Shape and Pose Estimation

TransPoser: Transformer as an Optimizer for Joint Object Shape and Pose Estimation

URL: http://arxiv.org/abs/2303.13477v1
Date: Thu, 23 Mar 2023 17:46:54 GMT
Title: TransPoser: Transformer as an Optimizer for Joint Object Shape and Pose Estimation
Authors: Yuta Yoshitake, Mai Nishimura, Shohei Nobuhara, Ko Nishino
Abstract summary: We propose a novel method for joint estimation of shape and pose of rigid objects from their sequentially observed RGB-D images. We introduce Deep Directional Distance Function (DeepDDF), a neural network that directly outputs the depth image of an object given the camera viewpoint and viewing direction. We formulate the joint estimation itself as a Transformer which we refer to as TransPoser.
Score: 25.395619346823715
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose a novel method for joint estimation of shape and pose of rigid objects from their sequentially observed RGB-D images. In sharp contrast to past approaches that rely on complex non-linear optimization, we propose to formulate it as a neural optimization that learns to efficiently estimate the shape and pose. We introduce Deep Directional Distance Function (DeepDDF), a neural network that directly outputs the depth image of an object given the camera viewpoint and viewing direction, for efficient error computation in 2D image space. We formulate the joint estimation itself as a Transformer which we refer to as TransPoser. We fully leverage the tokenization and multi-head attention to sequentially process the growing set of observations and to efficiently update the shape and pose with a learned momentum, respectively. Experimental results on synthetic and real data show that DeepDDF achieves high accuracy as a category-level object shape representation and TransPoser achieves state-of-the-art accuracy efficiently for joint shape and pose estimation.

Related papers

HORT: Monocular Hand-held Objects Reconstruction with Transformers [61.36376511119355]
Reconstructing hand-held objects in 3D from monocular images is a significant challenge in computer vision. We propose a transformer-based model to efficiently reconstruct dense 3D point clouds of hand-held objects. Our method achieves state-of-the-art accuracy with much faster inference speed, while generalizing well to in-the-wild images.
arXiv Detail & Related papers (2025-03-27T09:45:09Z)
EDM: Equirectangular Projection-Oriented Dense Kernelized Feature Matching [49.15373858190824]
We introduce the first learning-based dense matching algorithm, termed Equirectangular Projection-Oriented Kernelized Feature Matching (EDM) Our proposed method achieves notable performance enhancements, with improvements of +26.72 and +42.62 in AUC@5deg on the Matterport3D and Stanford2D3D datasets.
arXiv Detail & Related papers (2025-02-28T03:37:01Z)
RDPN6D: Residual-based Dense Point-wise Network for 6Dof Object Pose Estimation Based on RGB-D Images [13.051302134031808]
We introduce a novel method for calculating the 6DoF pose of an object using a single RGB-D image. Unlike existing methods that either directly predict objects' poses or rely on sparse keypoints for pose recovery, our approach addresses this challenging task using dense correspondence.
arXiv Detail & Related papers (2024-05-14T10:10:45Z)
Toward Accurate Camera-based 3D Object Detection via Cascade Depth Estimation and Calibration [20.82054596017465]
Recent camera-based 3D object detection is limited by the precision of transforming from image to 3D feature spaces. This paper aims to address such a fundamental problem of camera-based 3D object detection: How to effectively learn depth information for accurate feature lifting and object localization.
arXiv Detail & Related papers (2024-02-07T14:21:26Z)
Diff-DOPE: Differentiable Deep Object Pose Estimation [29.703385848843414]
We introduce Diff-DOPE, a 6-DoF pose refiner that takes as input an image, a 3D textured model of an object, and an initial pose of the object. The method uses differentiable rendering to update the object pose to minimize the visual error between the image and the projection of the model. We show that this simple, yet effective, idea is able to achieve state-of-the-art results on pose estimation datasets.
arXiv Detail & Related papers (2023-09-30T18:52:57Z)
FrozenRecon: Pose-free 3D Scene Reconstruction with Frozen Depth Models [67.96827539201071]
We propose a novel test-time optimization approach for 3D scene reconstruction. Our method achieves state-of-the-art cross-dataset reconstruction on five zero-shot testing datasets.
arXiv Detail & Related papers (2023-08-10T17:55:02Z)
RNNPose: Recurrent 6-DoF Object Pose Refinement with Robust Correspondence Field Estimation and Pose Optimization [46.144194562841435]
We propose a framework based on a recurrent neural network (RNN) for object pose refinement. The problem is formulated as a non-linear least squares problem based on the estimated correspondence field. The correspondence field estimation and pose refinement are conducted alternatively in each iteration to recover accurate object poses.
arXiv Detail & Related papers (2022-03-24T06:24:55Z)
Aug3D-RPN: Improving Monocular 3D Object Detection by Synthetic Images with Virtual Depth [64.29043589521308]
We propose a rendering module to augment the training data by synthesizing images with virtual-depths. The rendering module takes as input the RGB image and its corresponding sparse depth image, outputs a variety of photo-realistic synthetic images. Besides, we introduce an auxiliary module to improve the detection model by jointly optimizing it through a depth estimation task.
arXiv Detail & Related papers (2021-07-28T11:00:47Z)
GDRNPP: A Geometry-guided and Fully Learning-based Object Pose Estimator [51.89441403642665]
6D pose estimation of rigid objects is a long-standing and challenging task in computer vision. Recently, the emergence of deep learning reveals the potential of Convolutional Neural Networks (CNNs) to predict reliable 6D poses. This paper introduces a fully learning-based object pose estimator.
arXiv Detail & Related papers (2021-02-24T09:11:31Z)
Spatial Attention Improves Iterative 6D Object Pose Estimation [52.365075652976735]
We propose a new method for 6D pose estimation refinement from RGB images. Our main insight is that after the initial pose estimate, it is important to pay attention to distinct spatial features of the object. We experimentally show that this approach learns to attend to salient spatial features and learns to ignore occluded parts of the object, leading to better pose estimation across datasets.
arXiv Detail & Related papers (2021-01-05T17:18:52Z)
Robust Consistent Video Depth Estimation [65.53308117778361]
We present an algorithm for estimating consistent dense depth maps and camera poses from a monocular video. Our algorithm combines two complementary techniques: (1) flexible deformation-splines for low-frequency large-scale alignment and (2) geometry-aware depth filtering for high-frequency alignment of fine depth details. In contrast to prior approaches, our method does not require camera poses as input and achieves robust reconstruction for challenging hand-held cell phone captures containing a significant amount of noise, shake, motion blur, and rolling shutter deformations.
arXiv Detail & Related papers (2020-12-10T18:59:48Z)
Category Level Object Pose Estimation via Neural Analysis-by-Synthesis [64.14028598360741]
In this paper we combine a gradient-based fitting procedure with a parametric neural image synthesis module. The image synthesis network is designed to efficiently span the pose configuration space. We experimentally show that the method can recover orientation of objects with high accuracy from 2D images alone.
arXiv Detail & Related papers (2020-08-18T20:30:47Z)
PrimA6D: Rotational Primitive Reconstruction for Enhanced and Robust 6D Pose Estimation [11.873744190924599]
We introduce a rotational primitive prediction based 6D object pose estimation using a single image as an input. We leverage a Variational AutoEncoder (VAE) to learn this underlying primitive and its associated keypoints. When evaluated over public datasets, our method yields a notable improvement over LINEMOD, Occlusion LINEMOD, and the Y-induced dataset.
arXiv Detail & Related papers (2020-06-14T03:55:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.