VideoPose: Estimating 6D object pose from videos
- URL: http://arxiv.org/abs/2111.10677v1
- Date: Sat, 20 Nov 2021 20:57:45 GMT
- Title: VideoPose: Estimating 6D object pose from videos
- Authors: Apoorva Beedu, Zhile Ren, Varun Agrawal, Irfan Essa
- Abstract summary: We introduce a simple yet effective algorithm that uses convolutional neural networks to directly estimate object poses from videos.
Our proposed network takes a pre-trained 2D object detector as input, and aggregates visual features through a recurrent neural network to make predictions at each frame.
Experimental evaluation on the YCB-Video dataset show that our approach is on par with the state-of-the-art algorithms.
- Score: 14.210010379733017
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a simple yet effective algorithm that uses convolutional neural
networks to directly estimate object poses from videos. Our approach leverages
the temporal information from a video sequence, and is computationally
efficient and robust to support robotic and AR domains. Our proposed network
takes a pre-trained 2D object detector as input, and aggregates visual features
through a recurrent neural network to make predictions at each frame.
Experimental evaluation on the YCB-Video dataset show that our approach is on
par with the state-of-the-art algorithms. Further, with a speed of 30 fps, it
is also more efficient than the state-of-the-art, and therefore applicable to a
variety of applications that require real-time object pose estimation.
Related papers
- Semi-supervised 3D Video Information Retrieval with Deep Neural Network
and Bi-directional Dynamic-time Warping Algorithm [14.39527406033429]
The proposed algorithm is designed to handle large video datasets and retrieve the most related videos to a given inquiry video clip.
We split both the candidate and the inquiry videos into a sequence of clips and convert each clip to a representation vector using an autoencoder-backed deep neural network.
We then calculate a similarity measure between the sequences of embedding vectors using a bi-directional dynamic time-warping method.
arXiv Detail & Related papers (2023-09-03T03:10:18Z) - Uncertainty Aware Active Learning for Reconfiguration of Pre-trained
Deep Object-Detection Networks for New Target Domains [0.0]
Object detection is one of the most important and fundamental aspects of computer vision tasks.
To obtain training data for object detection model efficiently, many datasets opt to obtain their unannotated data in video format.
Annotating every frame from a video is costly and inefficient since many frames contain very similar information for the model to learn from.
In this paper, we proposed a novel active learning algorithm for object detection models to tackle this problem.
arXiv Detail & Related papers (2023-03-22T17:14:10Z) - Dyna-DepthFormer: Multi-frame Transformer for Self-Supervised Depth
Estimation in Dynamic Scenes [19.810725397641406]
We propose a novel Dyna-Depthformer framework, which predicts scene depth and 3D motion field jointly.
Our contributions are two-fold. First, we leverage multi-view correlation through a series of self- and cross-attention layers in order to obtain enhanced depth feature representation.
Second, we propose a warping-based Motion Network to estimate the motion field of dynamic objects without using semantic prior.
arXiv Detail & Related papers (2023-01-14T09:43:23Z) - Deep Learning Computer Vision Algorithms for Real-time UAVs On-board
Camera Image Processing [77.34726150561087]
This paper describes how advanced deep learning based computer vision algorithms are applied to enable real-time on-board sensor processing for small UAVs.
All algorithms have been developed using state-of-the-art image processing methods based on deep neural networks.
arXiv Detail & Related papers (2022-11-02T11:10:42Z) - Video based Object 6D Pose Estimation using Transformers [6.951360830202521]
VideoPose is an end-to-end attention based modelling architecture that attends to previous frames in order to estimate 6D Object Poses in videos.
Our architecture is able to capture and reason from long-range dependencies efficiently, thus iteratively refining over video sequences.
Our approach is on par with the state-of-the-art Transformer methods, and performs significantly better relative to CNN based approaches.
arXiv Detail & Related papers (2022-10-24T18:45:53Z) - Differentiable Frequency-based Disentanglement for Aerial Video Action
Recognition [56.91538445510214]
We present a learning algorithm for human activity recognition in videos.
Our approach is designed for UAV videos, which are mainly acquired from obliquely placed dynamic cameras.
We conduct extensive experiments on the UAV Human dataset and the NEC Drone dataset.
arXiv Detail & Related papers (2022-09-15T22:16:52Z) - FvOR: Robust Joint Shape and Pose Optimization for Few-view Object
Reconstruction [37.81077373162092]
Reconstructing an accurate 3D object model from a few image observations remains a challenging problem in computer vision.
We present FvOR, a learning-based object reconstruction method that predicts accurate 3D models given a few images with noisy input poses.
arXiv Detail & Related papers (2022-05-16T15:39:27Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Self-Attentive 3D Human Pose and Shape Estimation from Videos [82.63503361008607]
We present a video-based learning algorithm for 3D human pose and shape estimation.
We exploit temporal information in videos and propose a self-attention module.
We evaluate our method on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets.
arXiv Detail & Related papers (2021-03-26T00:02:19Z) - Fast Motion Understanding with Spatiotemporal Neural Networks and
Dynamic Vision Sensors [99.94079901071163]
This paper presents a Dynamic Vision Sensor (DVS) based system for reasoning about high speed motion.
We consider the case of a robot at rest reacting to a small, fast approaching object at speeds higher than 15m/s.
We highlight the results of our system to a toy dart moving at 23.4m/s with a 24.73deg error in $theta$, 18.4mm average discretized radius prediction error, and 25.03% median time to collision prediction error.
arXiv Detail & Related papers (2020-11-18T17:55:07Z) - A Real-time Action Representation with Temporal Encoding and Deep
Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation.
T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed.
Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.