KITE: Keypoint-Conditioned Policies for Semantic Manipulation
- URL: http://arxiv.org/abs/2306.16605v4
- Date: Wed, 11 Oct 2023 18:09:59 GMT
- Title: KITE: Keypoint-Conditioned Policies for Semantic Manipulation
- Authors: Priya Sundaresan, Suneel Belkhale, Dorsa Sadigh, Jeannette Bohg
- Abstract summary: Keypoints + Instructions to Execution (KITE) is a two-step framework for semantic manipulation.
It first grounds an input instruction in a visual scene through 2D image keypoints.
KITE then executes a learned keypoint-conditioned skill to carry out the instruction.
- Score: 40.63568980167196
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While natural language offers a convenient shared interface for humans and
robots, enabling robots to interpret and follow language commands remains a
longstanding challenge in manipulation. A crucial step to realizing a
performant instruction-following robot is achieving semantic manipulation,
where a robot interprets language at different specificities, from high-level
instructions like "Pick up the stuffed animal" to more detailed inputs like
"Grab the left ear of the elephant." To tackle this, we propose Keypoints +
Instructions to Execution (KITE), a two-step framework for semantic
manipulation which attends to both scene semantics (distinguishing between
different objects in a visual scene) and object semantics (precisely localizing
different parts within an object instance). KITE first grounds an input
instruction in a visual scene through 2D image keypoints, providing a highly
accurate object-centric bias for downstream action inference. Provided an RGB-D
scene observation, KITE then executes a learned keypoint-conditioned skill to
carry out the instruction. The combined precision of keypoints and
parameterized skills enables fine-grained manipulation with generalization to
scene and object variations. Empirically, we demonstrate KITE in 3 real-world
environments: long-horizon 6-DoF tabletop manipulation, semantic grasping, and
a high-precision coffee-making task. In these settings, KITE achieves a 75%,
70%, and 71% overall success rate for instruction-following, respectively. KITE
outperforms frameworks that opt for pre-trained visual language models over
keypoint-based grounding, or omit skills in favor of end-to-end visuomotor
control, all while being trained from fewer or comparable amounts of
demonstrations. Supplementary material, datasets, code, and videos can be found
on our website: http://tinyurl.com/kite-site.
Related papers
- NaturalVLM: Leveraging Fine-grained Natural Language for
Affordance-Guided Visual Manipulation [21.02437461550044]
Many real-world tasks demand intricate multi-step reasoning.
We introduce a benchmark, NrVLM, comprising 15 distinct manipulation tasks.
We propose a novel learning framework that completes the manipulation task step-by-step according to the fine-grained instructions.
arXiv Detail & Related papers (2024-03-13T09:12:16Z) - Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z) - Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
Control [140.48218261864153]
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control.
Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
arXiv Detail & Related papers (2023-07-28T21:18:02Z) - Leveraging Large (Visual) Language Models for Robot 3D Scene
Understanding [25.860680905256174]
We investigate the use of pre-trained language models to impart common sense for scene understanding.
We find that the best approaches in both categories yield $sim 70%$ room classification accuracy.
arXiv Detail & Related papers (2022-09-12T21:36:58Z) - Learning 6-DoF Object Poses to Grasp Category-level Objects by Language
Instructions [74.63313641583602]
This paper studies the task of any objects grasping from the known categories by free-form language instructions.
We bring these disciplines together on this open challenge, which is essential to human-robot interaction.
We propose a language-guided 6-DoF category-level object localization model to achieve robotic grasping by comprehending human intention.
arXiv Detail & Related papers (2022-05-09T04:25:14Z) - Learning Language-Conditioned Robot Behavior from Offline Data and
Crowd-Sourced Annotation [80.29069988090912]
We study the problem of learning a range of vision-based manipulation tasks from a large offline dataset of robot interaction.
We propose to leverage offline robot datasets with crowd-sourced natural language labels.
We find that our approach outperforms both goal-image specifications and language conditioned imitation techniques by more than 25%.
arXiv Detail & Related papers (2021-09-02T17:42:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.