Understanding Contexts Inside Robot and Human Manipulation Tasks through
a Vision-Language Model and Ontology System in a Video Stream
- URL: http://arxiv.org/abs/2003.01163v1
- Date: Mon, 2 Mar 2020 19:48:59 GMT
- Title: Understanding Contexts Inside Robot and Human Manipulation Tasks through
a Vision-Language Model and Ontology System in a Video Stream
- Authors: Chen Jiang, Masood Dehghan, Martin Jagersand
- Abstract summary: We present a vision dataset under a strictly constrained knowledge domain for both robot and human manipulations.
We propose a scheme to generate a combination of visual attentions and an evolving knowledge graph filled with commonsense knowledge.
The proposed scheme allows the robot to mimic human-like intentional behaviors by watching real-time videos.
- Score: 4.450615100675747
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Manipulation tasks in daily life, such as pouring water, unfold intentionally
under specialized manipulation contexts. Being able to process contextual
knowledge in these Activities of Daily Living (ADLs) over time can help us
understand manipulation intentions, which are essential for an intelligent
robot to transition smoothly between various manipulation actions. In this
paper, to model the intended concepts of manipulation, we present a vision
dataset under a strictly constrained knowledge domain for both robot and human
manipulations, where manipulation concepts and relations are stored by an
ontology system in a taxonomic manner. Furthermore, we propose a scheme to
generate a combination of visual attentions and an evolving knowledge graph
filled with commonsense knowledge. Our scheme works with real-world camera
streams and fuses an attention-based Vision-Language model with the ontology
system. The experimental results demonstrate that the proposed scheme can
successfully represent the evolution of an intended object manipulation
procedure for both robots and humans. The proposed scheme allows the robot to
mimic human-like intentional behaviors by watching real-time videos. We aim to
develop this scheme further for real-world robot intelligence in Human-Robot
Interaction.
Related papers
- Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation [65.46610405509338]
We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation.
Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal.
We show that this approach of combining scalably learned track prediction with a residual policy enables diverse generalizable robot manipulation.
arXiv Detail & Related papers (2024-05-02T17:56:55Z) - Towards Generalizable Zero-Shot Manipulation via Translating Human
Interaction Plans [58.27029676638521]
We show how passive human videos can serve as a rich source of data for learning such generalist robots.
We learn a human plan predictor that, given a current image of a scene and a goal image, predicts the future hand and object configurations.
We show that our learned system can perform over 16 manipulation skills that generalize to 40 objects.
arXiv Detail & Related papers (2023-12-01T18:54:12Z) - Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - Zero-Shot Robot Manipulation from Passive Human Videos [59.193076151832145]
We develop a framework for extracting agent-agnostic action representations from human videos.
Our framework is based on predicting plausible human hand trajectories.
We deploy the trained model zero-shot for physical robot manipulation tasks.
arXiv Detail & Related papers (2023-02-03T21:39:52Z) - A Road-map to Robot Task Execution with the Functional Object-Oriented
Network [77.93376696738409]
functional object-oriented network (FOON) is a knowledge graph representation for robots.
Taking the form of a bipartite graph, a FOON contains symbolic or high-level information that would be pertinent to a robot's understanding of its environment and tasks.
arXiv Detail & Related papers (2021-06-01T00:43:04Z) - Learning by Watching: Physical Imitation of Manipulation Skills from
Human Videos [28.712673809577076]
We present an approach for physical imitation from human videos for robot manipulation tasks.
We design a perception module that learns to translate human videos to the robot domain followed by unsupervised keypoint detection.
We evaluate the effectiveness of our approach on five robot manipulation tasks, including reaching, pushing, sliding, coffee making, and drawer closing.
arXiv Detail & Related papers (2021-01-18T18:50:32Z) - Learning Predictive Models From Observation and Interaction [137.77887825854768]
Learning predictive models from interaction with the world allows an agent, such as a robot, to learn about how the world works.
However, learning a model that captures the dynamics of complex skills represents a major challenge.
We propose a method to augment the training set with observational data of other agents, such as humans.
arXiv Detail & Related papers (2019-12-30T01:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.