S3K: Self-Supervised Semantic Keypoints for Robotic Manipulation via
Multi-View Consistency
- URL: http://arxiv.org/abs/2009.14711v2
- Date: Tue, 13 Oct 2020 10:42:41 GMT
- Title: S3K: Self-Supervised Semantic Keypoints for Robotic Manipulation via
Multi-View Consistency
- Authors: Mel Vecerik, Jean-Baptiste Regli, Oleg Sushkov, David Barker, Rugile
Pevceviciute, Thomas Roth\"orl, Christopher Schuster, Raia Hadsell, Lourdes
Agapito, Jonathan Scholz
- Abstract summary: We advocate semantic 3D keypoints as a visual representation, and present a semi-supervised training objective.
Unlike local texture-based approaches, our model integrates contextual information from a large area.
We demonstrate that this ability to locate semantic keypoints enables high level scripting of human understandable behaviours.
- Score: 11.357804868755155
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A robot's ability to act is fundamentally constrained by what it can
perceive. Many existing approaches to visual representation learning utilize
general-purpose training criteria, e.g. image reconstruction, smoothness in
latent space, or usefulness for control, or else make use of large datasets
annotated with specific features (bounding boxes, segmentations, etc.).
However, both approaches often struggle to capture the fine-detail required for
precision tasks on specific objects, e.g. grasping and mating a plug and
socket. We argue that these difficulties arise from a lack of geometric
structure in these models. In this work we advocate semantic 3D keypoints as a
visual representation, and present a semi-supervised training objective that
can allow instance or category-level keypoints to be trained to 1-5
millimeter-accuracy with minimal supervision. Furthermore, unlike local
texture-based approaches, our model integrates contextual information from a
large area and is therefore robust to occlusion, noise, and lack of discernible
texture. We demonstrate that this ability to locate semantic keypoints enables
high level scripting of human understandable behaviours. Finally we show that
these keypoints provide a good way to define reward functions for reinforcement
learning and are a good representation for training agents.
Related papers
- Keypoint Abstraction using Large Models for Object-Relative Imitation Learning [78.92043196054071]
Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics.
Keypoint-based representations have been proven effective as a succinct representation for essential object capturing features.
We propose KALM, a framework that leverages large pre-trained vision-language models to automatically generate task-relevant and cross-instance consistent keypoints.
arXiv Detail & Related papers (2024-10-30T17:37:31Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - Generalized Label-Efficient 3D Scene Parsing via Hierarchical Feature
Aligned Pre-Training and Region-Aware Fine-tuning [55.517000360348725]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.
To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.
Experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning.
arXiv Detail & Related papers (2023-12-01T15:47:04Z) - Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner.
Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z) - 3D Human Keypoints Estimation From Point Clouds in the Wild Without
Human Labels [78.69095161350059]
GC-KPL is an approach for learning 3D human joint locations from point clouds without human labels.
We show that by training on a large training set without any human annotated keypoints, we are able to achieve reasonable performance as compared to the fully supervised approach.
arXiv Detail & Related papers (2023-06-07T19:46:30Z) - ALSO: Automotive Lidar Self-supervision by Occupancy estimation [70.70557577874155]
We propose a new self-supervised method for pre-training the backbone of deep perception models operating on point clouds.
The core idea is to train the model on a pretext task which is the reconstruction of the surface on which the 3D points are sampled.
The intuition is that if the network is able to reconstruct the scene surface, given only sparse input points, then it probably also captures some fragments of semantic information.
arXiv Detail & Related papers (2022-12-12T13:10:19Z) - Sim2Real Object-Centric Keypoint Detection and Description [40.58367357980036]
Keypoint detection and description play a central role in computer vision.
We propose the object-centric formulation, which requires further identifying which object each interest point belongs to.
We develop a sim2real contrastive learning mechanism that can generalize the model trained in simulation to real-world applications.
arXiv Detail & Related papers (2022-02-01T15:00:20Z) - Few-Shot Keypoint Detection as Task Adaptation via Latent Embeddings [17.04471874483516]
Existing approaches either compute dense keypoint embeddings in a single forward pass, or allocate their full capacity to a sparse set of points.
In this paper we explore a middle ground based on the observation that the number of relevant points at a given time are typically relatively few.
Our main contribution is a novel architecture, inspired by few-shot task adaptation, which allows a sparse-style network to condition on a keypoint embedding.
arXiv Detail & Related papers (2021-12-09T13:25:42Z) - Unsupervised Learning of Visual 3D Keypoints for Control [104.92063943162896]
Learning sensorimotor control policies from high-dimensional images crucially relies on the quality of the underlying visual representations.
We propose a framework to learn such a 3D geometric structure directly from images in an end-to-end unsupervised manner.
These discovered 3D keypoints tend to meaningfully capture robot joints as well as object movements in a consistent manner across both time and 3D space.
arXiv Detail & Related papers (2021-06-14T17:59:59Z) - Multi-Modal Learning of Keypoint Predictive Models for Visual Object
Manipulation [6.853826783413853]
Humans have impressive generalization capabilities when it comes to manipulating objects in novel environments.
How to learn such body schemas for robots remains an open problem.
We develop an self-supervised approach that can extend a robot's kinematic model when grasping an object from visual latent representations.
arXiv Detail & Related papers (2020-11-08T01:04:59Z) - CoKe: Localized Contrastive Learning for Robust Keypoint Detection [24.167397429511915]
We show that keypoint kernels can be chosen to optimize three types of distances in the feature space.
We formulate this optimization process within a framework, which includes supervised contrastive learning.
CoKe achieves state-of-the-art results compared to approaches that jointly represent all keypoints holistically.
arXiv Detail & Related papers (2020-09-29T16:00:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.