Video based Object 6D Pose Estimation using Transformers
- URL: http://arxiv.org/abs/2210.13540v1
- Date: Mon, 24 Oct 2022 18:45:53 GMT
- Title: Video based Object 6D Pose Estimation using Transformers
- Authors: Apoorva Beedu, Huda Alamri, Irfan Essa
- Abstract summary: VideoPose is an end-to-end attention based modelling architecture that attends to previous frames in order to estimate 6D Object Poses in videos.
Our architecture is able to capture and reason from long-range dependencies efficiently, thus iteratively refining over video sequences.
Our approach is on par with the state-of-the-art Transformer methods, and performs significantly better relative to CNN based approaches.
- Score: 6.951360830202521
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a Transformer based 6D Object Pose Estimation framework
VideoPose, comprising an end-to-end attention based modelling architecture,
that attends to previous frames in order to estimate accurate 6D Object Poses
in videos. Our approach leverages the temporal information from a video
sequence for pose refinement, along with being computationally efficient and
robust. Compared to existing methods, our architecture is able to capture and
reason from long-range dependencies efficiently, thus iteratively refining over
video sequences. Experimental evaluation on the YCB-Video dataset shows that
our approach is on par with the state-of-the-art Transformer methods, and
performs significantly better relative to CNN based approaches. Further, with a
speed of 33 fps, it is also more efficient and therefore applicable to a
variety of applications that require real-time object pose estimation. Training
code and pretrained models are available at
https://github.com/ApoorvaBeedu/VideoPose
Related papers
- KRONC: Keypoint-based Robust Camera Optimization for 3D Car Reconstruction [58.04846444985808]
This paper introduces KRONC, a novel approach aimed at inferring view poses by leveraging prior knowledge about the object to reconstruct and its representation through semantic keypoints.
With a focus on vehicle scenes, KRONC is able to estimate the position of the views as a solution to a light optimization problem targeting the convergence of keypoints' back-projections to a singular point.
arXiv Detail & Related papers (2024-09-09T08:08:05Z) - FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects [55.77542145604758]
FoundationPose is a unified foundation model for 6D object pose estimation and tracking.
Our approach can be instantly applied at test-time to a novel object without fine-tuning.
arXiv Detail & Related papers (2023-12-13T18:28:09Z) - PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point
Tracking [90.29143475328506]
We introduce PointOdyssey, a large-scale synthetic dataset, and data generation framework.
Our goal is to advance the state-of-the-art by placing emphasis on long videos with naturalistic motion.
We animate deformable characters using real-world motion capture data, we build 3D scenes to match the motion capture environments, and we render camera viewpoints using trajectories mined via structure-from-motion on real videos.
arXiv Detail & Related papers (2023-07-27T17:58:11Z) - YOLOPose V2: Understanding and Improving Transformer-based 6D Pose
Estimation [36.067414358144816]
YOLOPose is a Transformer-based multi-object 6D pose estimation method.
We employ a learnable orientation estimation module to predict the orientation from the keypoints.
Our method is suitable for real-time applications and achieves results comparable to state-of-the-art methods.
arXiv Detail & Related papers (2023-07-21T12:53:54Z) - MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare [84.80956484848505]
MegaPose is a method to estimate the 6D pose of novel objects, that is, objects unseen during training.
We present a 6D pose refiner based on a render&compare strategy which can be applied to novel objects.
Second, we introduce a novel approach for coarse pose estimation which leverages a network trained to classify whether the pose error between a synthetic rendering and an observed image of the same object can be corrected by the refiner.
arXiv Detail & Related papers (2022-12-13T19:30:03Z) - Patch-based Object-centric Transformers for Efficient Video Generation [71.55412580325743]
We present Patch-based Object-centric Video Transformer (POVT), a novel region-based video generation architecture.
We build upon prior work in video prediction via an autoregressive transformer over the discrete latent space of compressed videos.
Due to better compressibility of object-centric representations, we can improve training efficiency by allowing the model to only access object information for longer horizon temporal information.
arXiv Detail & Related papers (2022-06-08T16:29:59Z) - FvOR: Robust Joint Shape and Pose Optimization for Few-view Object
Reconstruction [37.81077373162092]
Reconstructing an accurate 3D object model from a few image observations remains a challenging problem in computer vision.
We present FvOR, a learning-based object reconstruction method that predicts accurate 3D models given a few images with noisy input poses.
arXiv Detail & Related papers (2022-05-16T15:39:27Z) - VideoPose: Estimating 6D object pose from videos [14.210010379733017]
We introduce a simple yet effective algorithm that uses convolutional neural networks to directly estimate object poses from videos.
Our proposed network takes a pre-trained 2D object detector as input, and aggregates visual features through a recurrent neural network to make predictions at each frame.
Experimental evaluation on the YCB-Video dataset show that our approach is on par with the state-of-the-art algorithms.
arXiv Detail & Related papers (2021-11-20T20:57:45Z) - T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression [40.90172673391803]
T6D-Direct is a real-time single-stage direct method with a transformer-based architecture built on DETR to perform 6D multi-object pose direct estimation.
Our method achieves the fastest inference time, and the pose estimation accuracy is comparable to state-of-the-art methods.
arXiv Detail & Related papers (2021-09-22T18:13:33Z) - Self-Attentive 3D Human Pose and Shape Estimation from Videos [82.63503361008607]
We present a video-based learning algorithm for 3D human pose and shape estimation.
We exploit temporal information in videos and propose a self-attention module.
We evaluate our method on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets.
arXiv Detail & Related papers (2021-03-26T00:02:19Z) - PrimA6D: Rotational Primitive Reconstruction for Enhanced and Robust 6D
Pose Estimation [11.873744190924599]
We introduce a rotational primitive prediction based 6D object pose estimation using a single image as an input.
We leverage a Variational AutoEncoder (VAE) to learn this underlying primitive and its associated keypoints.
When evaluated over public datasets, our method yields a notable improvement over LINEMOD, Occlusion LINEMOD, and the Y-induced dataset.
arXiv Detail & Related papers (2020-06-14T03:55:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.