Understanding Action Sequences based on Video Captioning for
Learning-from-Observation
- URL: http://arxiv.org/abs/2101.05061v1
- Date: Wed, 9 Dec 2020 05:22:01 GMT
- Title: Understanding Action Sequences based on Video Captioning for
Learning-from-Observation
- Authors: Iori Yanokura, Naoki Wake, Kazuhiro Sasabuchi, Katsushi Ikeuchi,
Masayuki Inaba
- Abstract summary: We propose a Learning-from-Observation framework that splits and understands a video of a human demonstration with verbal instructions to extract accurate action sequences.
The splitting is done based on local minimum points of the hand velocity, which align human daily-life actions with object-centered face contact transitions.
We match the motion descriptions with the verbal instructions to understand the correct human intent and ignore the unintended actions inside the video.
- Score: 14.467714234267307
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning actions from human demonstration video is promising for intelligent
robotic systems. Extracting the exact section and re-observing the extracted
video section in detail is important for imitating complex skills because human
motions give valuable hints for robots. However, the general video
understanding methods focus more on the understanding of the full frame,lacking
consideration on extracting accurate sections and aligning them with the
human's intent. We propose a Learning-from-Observation framework that splits
and understands a video of a human demonstration with verbal instructions to
extract accurate action sequences. The splitting is done based on local minimum
points of the hand velocity, which align human daily-life actions with
object-centered face contact transitions required for generating robot motion.
Then, we extract a motion description on the split videos using video
captioning techniques that are trained from our new daily-life action video
dataset. Finally, we match the motion descriptions with the verbal instructions
to understand the correct human intent and ignore the unintended actions inside
the video. We evaluate the validity of hand velocity-based video splitting and
demonstrate that it is effective. The experimental results on our new video
captioning dataset focusing on daily-life human actions demonstrate the
effectiveness of the proposed method. The source code, trained models, and the
dataset will be made available.
Related papers
- VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation [53.63540587160549]
VidBot is a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos.
VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.
arXiv Detail & Related papers (2025-03-10T10:04:58Z) - Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos [64.48857272250446]
We introduce Moto, which converts video content into latent Motion Token sequences by a Latent Motion Tokenizer.
We pre-train Moto-GPT through motion token autoregression, enabling it to capture diverse visual motion knowledge.
To transfer learned motion priors to real robot actions, we implement a co-fine-tuning strategy that seamlessly bridges latent motion token prediction and real robot control.
arXiv Detail & Related papers (2024-12-05T18:57:04Z) - MotionLLM: Understanding Human Behaviors from Human Motions and Videos [40.132643319573205]
This study delves into the realm of multi-modality (i.e., video and motion modalities) human behavior understanding.
We present MotionLLM, a framework for human motion understanding, captioning, and reasoning.
arXiv Detail & Related papers (2024-05-30T17:59:50Z) - Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training [69.54948297520612]
Learning a generalist embodied agent poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets.
We introduce a novel framework to tackle these challenges, which leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos.
Our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches.
arXiv Detail & Related papers (2024-02-22T09:48:47Z) - Any-point Trajectory Modeling for Policy Learning [64.23861308947852]
We introduce Any-point Trajectory Modeling (ATM) to predict future trajectories of arbitrary points within a video frame.
ATM outperforms strong video pre-training baselines by 80% on average.
We show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology.
arXiv Detail & Related papers (2023-12-28T23:34:43Z) - Learning Video-Conditioned Policies for Unseen Manipulation Tasks [83.2240629060453]
Video-conditioned Policy learning maps human demonstrations of previously unseen tasks to robot manipulation skills.
We learn our policy to generate appropriate actions given current scene observations and a video of the target task.
We validate our approach on a set of challenging multi-task robot manipulation environments and outperform state of the art.
arXiv Detail & Related papers (2023-05-10T16:25:42Z) - Zero-Shot Robot Manipulation from Passive Human Videos [59.193076151832145]
We develop a framework for extracting agent-agnostic action representations from human videos.
Our framework is based on predicting plausible human hand trajectories.
We deploy the trained model zero-shot for physical robot manipulation tasks.
arXiv Detail & Related papers (2023-02-03T21:39:52Z) - Learning Reward Functions for Robotic Manipulation by Observing Humans [92.30657414416527]
We use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies.
The learned rewards are based on distances to a goal in an embedding space learned using a time-contrastive objective.
arXiv Detail & Related papers (2022-11-16T16:26:48Z) - Video2Skill: Adapting Events in Demonstration Videos to Skills in an
Environment using Cyclic MDP Homomorphisms [16.939129935919325]
Video2Skill (V2S) attempts to extend this capability to artificial agents by allowing a robot arm to learn from human cooking videos.
We first use sequence-to-sequence Auto-Encoder style architectures to learn a temporal latent space for events in long-horizon demonstrations.
We then transfer these representations to the robotic target domain, using a small amount of offline and unrelated interaction data.
arXiv Detail & Related papers (2021-09-08T17:59:01Z) - Learning by Watching: Physical Imitation of Manipulation Skills from
Human Videos [28.712673809577076]
We present an approach for physical imitation from human videos for robot manipulation tasks.
We design a perception module that learns to translate human videos to the robot domain followed by unsupervised keypoint detection.
We evaluate the effectiveness of our approach on five robot manipulation tasks, including reaching, pushing, sliding, coffee making, and drawer closing.
arXiv Detail & Related papers (2021-01-18T18:50:32Z) - Learning Object Manipulation Skills via Approximate State Estimation
from Real Videos [47.958512470724926]
Humans are adept at learning new tasks by watching a few instructional videos.
On the other hand, robots that learn new actions either require a lot of effort through trial and error, or use expert demonstrations that are challenging to obtain.
In this paper, we explore a method that facilitates learning object manipulation skills directly from videos.
arXiv Detail & Related papers (2020-11-13T08:53:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.