Visual Descriptor Learning from Monocular Video
- URL: http://arxiv.org/abs/2004.07007v1
- Date: Wed, 15 Apr 2020 11:19:57 GMT
- Title: Visual Descriptor Learning from Monocular Video
- Authors: Umashankar Deekshith, Nishit Gajjar, Max Schwarz, Sven Behnke
- Abstract summary: We propose a novel way to estimate dense correspondence on an RGB image by training a fully convolutional network.
Our method learns from RGB videos using contrastive loss, where relative labeling is estimated from optical flow.
Not only does the method perform well on test data with the same background, it also generalizes to situations with a new background.
- Score: 25.082587246288995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Correspondence estimation is one of the most widely researched and yet only
partially solved area of computer vision with many applications in tracking,
mapping, recognition of objects and environment. In this paper, we propose a
novel way to estimate dense correspondence on an RGB image where visual
descriptors are learned from video examples by training a fully convolutional
network. Most deep learning methods solve this by training the network with a
large set of expensive labeled data or perform labeling through strong 3D
generative models using RGB-D videos. Our method learns from RGB videos using
contrastive loss, where relative labeling is estimated from optical flow. We
demonstrate the functionality in a quantitative analysis on rendered videos,
where ground truth information is available. Not only does the method perform
well on test data with the same background, it also generalizes to situations
with a new background. The descriptors learned are unique and the
representations determined by the network are global. We further show the
applicability of the method to real-world videos.
Related papers
- ViDaS Video Depth-aware Saliency Network [40.08270905030302]
We introduce ViDaS, a two-stream, fully convolutional Video, Depth-Aware Saliency network.
It addresses the problem of attention modeling in-the-wild" via saliency prediction in videos.
Network consists of two visual streams, one for the RGB frames, and one for the depth frames.
It is trained end-to-end and is evaluated in a variety of different databases with eye-tracking data.
arXiv Detail & Related papers (2023-05-19T15:04:49Z) - Self-Supervised Learning of Object Segmentation from Unlabeled RGB-D
Videos [11.40098981859033]
This work proposes a self-supervised learning system for segmenting rigid objects in RGB images.
The proposed pipeline is trained on unlabeled RGB-D videos of static objects, which can be captured with a camera carried by a mobile robot.
arXiv Detail & Related papers (2023-04-09T23:13:39Z) - ALSO: Automotive Lidar Self-supervision by Occupancy estimation [70.70557577874155]
We propose a new self-supervised method for pre-training the backbone of deep perception models operating on point clouds.
The core idea is to train the model on a pretext task which is the reconstruction of the surface on which the 3D points are sampled.
The intuition is that if the network is able to reconstruct the scene surface, given only sparse input points, then it probably also captures some fragments of semantic information.
arXiv Detail & Related papers (2022-12-12T13:10:19Z) - Pixel-level Correspondence for Self-Supervised Learning from Video [56.24439897867531]
Pixel-level Correspondence (PiCo) is a method for dense contrastive learning from video.
We validate PiCo on standard benchmarks, outperforming self-supervised baselines on multiple dense prediction tasks.
arXiv Detail & Related papers (2022-07-08T12:50:13Z) - Semantic keypoint-based pose estimation from single RGB frames [64.80395521735463]
We present an approach to estimating the continuous 6-DoF pose of an object from a single RGB image.
The approach combines semantic keypoints predicted by a convolutional network (convnet) with a deformable shape model.
We show that our approach can accurately recover the 6-DoF object pose for both instance- and class-based scenarios.
arXiv Detail & Related papers (2022-04-12T15:03:51Z) - Cloud based Scalable Object Recognition from Video Streams using
Orientation Fusion and Convolutional Neural Networks [11.44782606621054]
Convolutional neural networks (CNNs) have been widely used to perform intelligent visual object recognition.
CNNs still suffer from severe accuracy degradation, particularly on illumination-variant datasets.
We propose a new CNN method based on orientation fusion for visual object recognition.
arXiv Detail & Related papers (2021-06-19T07:15:15Z) - Few-Shot Learning for Video Object Detection in a Transfer-Learning
Scheme [70.45901040613015]
We study the new problem of few-shot learning for video object detection.
We employ a transfer-learning framework to effectively train the video object detector on a large number of base-class objects and a few video clips of novel-class objects.
arXiv Detail & Related papers (2021-03-26T20:37:55Z) - LCD -- Line Clustering and Description for Place Recognition [29.053923938306323]
We introduce a novel learning-based approach to place recognition, using RGB-D cameras and line clusters as visual and geometric features.
We present a neural network architecture based on the attention mechanism for frame-wise line clustering.
A similar neural network is used for the description of these clusters with a compact embedding of 128 floating point numbers.
arXiv Detail & Related papers (2020-10-21T09:52:47Z) - Self-supervised Video Representation Learning by Uncovering
Spatio-temporal Statistics [74.6968179473212]
This paper proposes a novel pretext task to address the self-supervised learning problem.
We compute a series of partitioning-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion.
A neural network is built and trained to yield the statistical summaries given the video frames as inputs.
arXiv Detail & Related papers (2020-08-31T08:31:56Z) - Single Image Depth Estimation Trained via Depth from Defocus Cues [105.67073923825842]
Estimating depth from a single RGB image is a fundamental task in computer vision.
In this work, we rely, instead of different views, on depth from focus cues.
We present results that are on par with supervised methods on KITTI and Make3D datasets and outperform unsupervised learning approaches.
arXiv Detail & Related papers (2020-01-14T20:22:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.