Perceive, Represent, Generate: Translating Multimodal Information to
Robotic Motion Trajectories
- URL: http://arxiv.org/abs/2204.03051v1
- Date: Wed, 6 Apr 2022 19:31:18 GMT
- Title: Perceive, Represent, Generate: Translating Multimodal Information to
Robotic Motion Trajectories
- Authors: F\'abio Vital, Miguel Vasco, Alberto Sardinha, and Francisco Melo
- Abstract summary: Perceive-Represent-Generate (PRG) is a framework that maps perceptual information of different modalities to an adequate sequence of movements to be executed by a robot.
We evaluate our pipeline in the context of a novel robotic handwriting task, where the robot receives as input a word through different perceptual modalities (e.g., image, sound) and generates the corresponding motion trajectory to write it.
- Score: 1.0499611180329804
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Perceive-Represent-Generate (PRG), a novel three-stage framework
that maps perceptual information of different modalities (e.g., visual or
sound), corresponding to a sequence of instructions, to an adequate sequence of
movements to be executed by a robot. In the first stage, we perceive and
pre-process the given inputs, isolating individual commands from the complete
instruction provided by a human user. In the second stage we encode the
individual commands into a multimodal latent space, employing a deep generative
model. Finally, in the third stage we convert the multimodal latent values into
individual trajectories and combine them into a single dynamic movement
primitive, allowing its execution in a robotic platform. We evaluate our
pipeline in the context of a novel robotic handwriting task, where the robot
receives as input a word through different perceptual modalities (e.g., image,
sound), and generates the corresponding motion trajectory to write it, creating
coherent and readable handwritten words.
Related papers
- Context-Aware Command Understanding for Tabletop Scenarios [1.7082212774297747]
This paper presents a novel hybrid algorithm designed to interpret natural human commands in tabletop scenarios.
By integrating multiple sources of information, including speech, gestures, and scene context, the system extracts actionable instructions for a robot.
We discuss the strengths and limitations of the system, with particular focus on how it handles multimodal command interpretation.
arXiv Detail & Related papers (2024-10-08T20:46:39Z) - Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer [62.29951737214263]
Existing algorithms directly generate the full sequence which is expensive and prone to errors.
We propose KeyMotion, that generates plausible human motion sequences corresponding to input text.
We use a Variationalcoder (VAE) with Kullback-Leibler regularization to project the Autoencoder into a latent space.
For the reverse diffusion, we propose a novel Parallel Skip Transformer that performs cross-modal attention between the design latents and text condition.
arXiv Detail & Related papers (2024-05-24T11:12:37Z) - RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis [102.1876259853457]
We propose a tree-structured multimodal code generation framework for generalized robotic behavior synthesis, termed RoboCodeX.
RoboCodeX decomposes high-level human instructions into multiple object-centric manipulation units consisting of physical preferences such as affordance and safety constraints.
To further enhance the capability to map conceptual and perceptual understanding into control commands, a specialized multimodal reasoning dataset is collected for pre-training and an iterative self-updating methodology is introduced for supervised fine-tuning.
arXiv Detail & Related papers (2024-02-25T15:31:43Z) - RoboScript: Code Generation for Free-Form Manipulation Tasks across Real
and Simulation [77.41969287400977]
This paper presents textbfRobotScript, a platform for a deployable robot manipulation pipeline powered by code generation.
We also present a benchmark for a code generation benchmark for robot manipulation tasks in free-form natural language.
We demonstrate the adaptability of our code generation framework across multiple robot embodiments, including the Franka and UR5 robot arms.
arXiv Detail & Related papers (2024-02-22T15:12:00Z) - Co-Speech Gesture Synthesis using Discrete Gesture Token Learning [1.1694169299062596]
Synthesizing realistic co-speech gestures is an important and yet unsolved problem for creating believable motions.
One challenge in learning the co-speech gesture model is that there may be multiple viable gesture motions for the same speech utterance.
We proposed a two-stage model to address this uncertainty issue in gesture synthesis by modeling the gesture segments as discrete latent codes.
arXiv Detail & Related papers (2023-03-04T01:42:09Z) - VIMA: General Robot Manipulation with Multimodal Prompts [82.01214865117637]
We show that a wide spectrum of robot manipulation tasks can be expressed with multimodal prompts.
We develop a new simulation benchmark that consists of thousands of procedurally-generated tabletop tasks.
We design a transformer-based robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively.
arXiv Detail & Related papers (2022-10-06T17:50:11Z) - Signs of Language: Embodied Sign Language Fingerspelling Acquisition
from Demonstrations for Human-Robot Interaction [1.0166477175169308]
We propose an approach for learning dexterous motor imitation from video examples without additional information.
We first build a URDF model of a robotic hand with a single actuator for each joint.
We then leverage pre-trained deep vision models to extract the 3D pose of the hand from RGB videos.
arXiv Detail & Related papers (2022-09-12T10:42:26Z) - Instruction-driven history-aware policies for robotic manipulations [82.25511767738224]
We propose a unified transformer-based approach that takes into account multiple inputs.
In particular, our transformer architecture integrates (i) natural language instructions and (ii) multi-view scene observations.
We evaluate our method on the challenging RLBench benchmark and on a real-world robot.
arXiv Detail & Related papers (2022-09-11T16:28:25Z) - TEMOS: Generating diverse human motions from textual descriptions [53.85978336198444]
We address the problem of generating diverse 3D human motions from textual descriptions.
We propose TEMOS, a text-conditioned generative model leveraging variational autoencoder (VAE) training with human motion data.
We show that TEMOS framework can produce both skeleton-based animations as in prior work, as well more expressive SMPL body motions.
arXiv Detail & Related papers (2022-04-25T14:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.