Action Image Representation: Learning Scalable Deep Grasping Policies
with Zero Real World Data
- URL: http://arxiv.org/abs/2005.06594v1
- Date: Wed, 13 May 2020 21:40:21 GMT
- Title: Action Image Representation: Learning Scalable Deep Grasping Policies
with Zero Real World Data
- Authors: Mohi Khansari, Daniel Kappler, Jianlan Luo, Jeff Bingham, Mrinal
Kalakrishnan
- Abstract summary: Action Image represents a grasp proposal as an image and uses a deep convolutional network to infer grasp quality.
We show that this representation works on a variety of inputs, including color images (RGB), depth images (D) and combined color-depth (RGB-D)
- Score: 12.554739620645917
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces Action Image, a new grasp proposal representation that
allows learning an end-to-end deep-grasping policy. Our model achieves $84\%$
grasp success on $172$ real world objects while being trained only in
simulation on $48$ objects with just naive domain randomization. Similar to
computer vision problems, such as object detection, Action Image builds on the
idea that object features are invariant to translation in image space.
Therefore, grasp quality is invariant when evaluating the object-gripper
relationship; a successful grasp for an object depends on its local context,
but is independent of the surrounding environment. Action Image represents a
grasp proposal as an image and uses a deep convolutional network to infer grasp
quality. We show that by using an Action Image representation, trained networks
are able to extract local, salient features of grasping tasks that generalize
across different objects and environments. We show that this representation
works on a variety of inputs, including color images (RGB), depth images (D),
and combined color-depth (RGB-D). Our experimental results demonstrate that
networks utilizing an Action Image representation exhibit strong domain
transfer between training on simulated data and inference on real-world sensor
streams. Finally, our experiments show that a network trained with Action Image
improves grasp success ($84\%$ vs. $53\%$) over a baseline model with the same
structure, but using actions encoded as vectors.
Related papers
- Natural Language Can Help Bridge the Sim2Real Gap [9.458180590551715]
Sim2Real is a promising paradigm for overcoming data scarcity in the real-world target domain.
We propose using natural language descriptions of images as a unifying signal across domains.
We demonstrate that training the image encoder to predict the language description serves as a useful, data-efficient pretraining step.
arXiv Detail & Related papers (2024-05-16T12:02:02Z) - CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition [73.51329037954866]
We propose a robust global representation method with cross-image correlation awareness for visual place recognition.
Our method uses the attention mechanism to correlate multiple images within a batch.
Our method outperforms state-of-the-art methods by a large margin with significantly less training time.
arXiv Detail & Related papers (2024-02-29T15:05:11Z) - Seeing the Unseen: Visual Common Sense for Semantic Placement [71.76026880991245]
Given an image, a vision system is asked to predict semantically-meaningful regions (masks or bounding boxes) in the image where that object could be placed or is likely be placed by humans.
We call this task: Semantic Placement (SP) and believe that such common-sense visual understanding is critical for assitive robots (tidying a house), and AR devices (automatically rendering an object in the user's space)
arXiv Detail & Related papers (2024-01-15T15:28:30Z) - IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks [124.90137528319273]
In this paper, we present IMProv, a generative model that is able to in-context learn visual tasks from multimodal prompts.
We train a masked generative transformer on a new dataset of figures from computer vision papers and their associated captions.
During inference time, we prompt the model with text and/or image task example(s) and have the model inpaint the corresponding output.
arXiv Detail & Related papers (2023-12-04T09:48:29Z) - Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects.
In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL)
A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z) - Context-driven Visual Object Recognition based on Knowledge Graphs [0.8701566919381223]
We propose an approach that enhances deep learning methods by using external contextual knowledge encoded in a knowledge graph.
We conduct a series of experiments to investigate the impact of different contextual views on the learned object representations for the same image dataset.
arXiv Detail & Related papers (2022-10-20T13:09:00Z) - Semantic decoupled representation learning for remote sensing image
change detection [17.548248093344576]
We propose a semantic decoupled representation learning for RS image CD.
We disentangle representations of different semantic regions by leveraging the semantic mask.
We additionally force the model to distinguish different semantic representations, which benefits the recognition of objects of interest in the downstream CD task.
arXiv Detail & Related papers (2022-01-15T07:35:26Z) - Self-Supervised Learning of Domain Invariant Features for Depth
Estimation [35.74969527929284]
We tackle the problem of unsupervised synthetic-to-realistic domain adaptation for single image depth estimation.
An essential building block of single image depth estimation is an encoder-decoder task network that takes RGB images as input and produces depth maps as output.
We propose a novel training strategy to force the task network to learn domain invariant representations in a self-supervised manner.
arXiv Detail & Related papers (2021-06-04T16:45:48Z) - Self-Supervised Representation Learning from Flow Equivariance [97.13056332559526]
We present a new self-supervised learning representation framework that can be directly deployed on a video stream of complex scenes.
Our representations, learned from high-resolution raw video, can be readily used for downstream tasks on static images.
arXiv Detail & Related papers (2021-01-16T23:44:09Z) - Domain Adaptation with Morphologic Segmentation [8.0698976170854]
We present a novel domain adaptation framework that uses morphologic segmentation to translate images from arbitrary input domains (real and synthetic) into a uniform output domain.
Our goal is to establish a preprocessing step that unifies data from multiple sources into a common representation.
We showcase the effectiveness of our approach by qualitatively and quantitatively evaluating our method on four data sets of simulated and real data of urban scenes.
arXiv Detail & Related papers (2020-06-16T17:06:02Z) - Instance-aware Image Colorization [51.12040118366072]
In this paper, we propose a method for achieving instance-aware colorization.
Our network architecture leverages an off-the-shelf object detector to obtain cropped object images.
We use a similar network to extract the full-image features and apply a fusion module to predict the final colors.
arXiv Detail & Related papers (2020-05-21T17:59:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.