Shaping embodied agent behavior with activity-context priors from
egocentric video
- URL: http://arxiv.org/abs/2110.07692v1
- Date: Thu, 14 Oct 2021 20:02:59 GMT
- Title: Shaping embodied agent behavior with activity-context priors from
egocentric video
- Authors: Tushar Nagarajan and Kristen Grauman
- Abstract summary: We introduce an approach to discover activity-context priors from in-the-wild egocentric video captured with human worn cameras.
We encode our video-based prior as an auxiliary reward function that encourages an agent to bring compatible objects together before attempting an interaction.
We demonstrate our idea using egocentric EPIC-Kitchens video of people performing unscripted kitchen activities to benefit virtual household robotic agents performing various complex tasks in AI2-iTHOR.
- Score: 102.0541532564505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Complex physical tasks entail a sequence of object interactions, each with
its own preconditions -- which can be difficult for robotic agents to learn
efficiently solely through their own experience. We introduce an approach to
discover activity-context priors from in-the-wild egocentric video captured
with human worn cameras. For a given object, an activity-context prior
represents the set of other compatible objects that are required for activities
to succeed (e.g., a knife and cutting board brought together with a tomato are
conducive to cutting). We encode our video-based prior as an auxiliary reward
function that encourages an agent to bring compatible objects together before
attempting an interaction. In this way, our model translates everyday human
experience into embodied agent skills. We demonstrate our idea using egocentric
EPIC-Kitchens video of people performing unscripted kitchen activities to
benefit virtual household robotic agents performing various complex tasks in
AI2-iTHOR, significantly accelerating agent learning. Project page:
http://vision.cs.utexas.edu/projects/ego-rewards/
Related papers
- A Backpack Full of Skills: Egocentric Video Understanding with Diverse
Task Perspectives [5.515192437680944]
We seek for a unified approach to video understanding which combines shared temporal modelling of human actions with minimal overhead.
We propose EgoPack, a solution that creates a collection of task perspectives that can be carried across downstream tasks and used as a potential source of additional insights.
We demonstrate the effectiveness and efficiency of our approach on four Ego4D benchmarks, outperforming current state-of-the-art methods.
arXiv Detail & Related papers (2024-03-05T15:18:02Z) - Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - Egocentric Video Task Translation [109.30649877677257]
We propose EgoTask Translation (EgoT2), which takes a collection of models optimized on separate tasks and learns to translate their outputs for improved performance on any or all of them at once.
Unlike traditional transfer or multi-task learning, EgoT2's flipped design entails separate task-specific backbones and a task translator shared across all tasks, which captures synergies between even heterogeneous tasks and mitigates task competition.
arXiv Detail & Related papers (2022-12-13T00:47:13Z) - Learning Reward Functions for Robotic Manipulation by Observing Humans [92.30657414416527]
We use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies.
The learned rewards are based on distances to a goal in an embedding space learned using a time-contrastive objective.
arXiv Detail & Related papers (2022-11-16T16:26:48Z) - Learning Object Manipulation Skills from Video via Approximate
Differentiable Physics [27.923004421974156]
We teach robots to perform simple object manipulation tasks by watching a single video demonstration.
A differentiable scene ensures perceptual fidelity between the 3D scene and the 2D video.
We evaluate our approach on a 3D reconstruction task that consists of 54 video demonstrations.
arXiv Detail & Related papers (2022-08-03T10:21:47Z) - Creating Multimodal Interactive Agents with Imitation and
Self-Supervised Learning [20.02604302565522]
A common vision from science fiction is that robots will one day inhabit our physical spaces, sense the world as we do, assist our physical labours, and communicate with us through natural language.
Here we study how to design artificial agents that can interact naturally with humans using the simplification of a virtual environment.
We show that imitation learning of human-human interactions in a simulated world, in conjunction with self-supervised learning, is sufficient to produce a multimodal interactive agent, which we call MIA, that successfully interacts with non-adversarial humans 75% of the time.
arXiv Detail & Related papers (2021-12-07T15:17:27Z) - Learning Visually Guided Latent Actions for Assistive Teleoperation [9.75385535829762]
We develop assistive robots that condition their latent embeddings on visual inputs.
We show that incorporating object detectors pretrained on small amounts of cheap, easy-to-collect structured data enables i) accurately recognizing the current context and ii) generalizing control embeddings to new objects and tasks.
arXiv Detail & Related papers (2021-05-02T23:58:28Z) - Learning Generalizable Robotic Reward Functions from "In-The-Wild" Human
Videos [59.58105314783289]
Domain-agnostic Video Discriminator (DVD) learns multitask reward functions by training a discriminator to classify whether two videos are performing the same task.
DVD can generalize by virtue of learning from a small amount of robot data with a broad dataset of human videos.
DVD can be combined with visual model predictive control to solve robotic manipulation tasks on a real WidowX200 robot in an unseen environment from a single human demo.
arXiv Detail & Related papers (2021-03-31T05:25:05Z) - The MECCANO Dataset: Understanding Human-Object Interactions from
Egocentric Videos in an Industrial-like Domain [20.99718135562034]
We introduce MECCANO, the first dataset of egocentric videos to study human-object interactions in industrial-like settings.
The dataset has been explicitly labeled for the task of recognizing human-object interactions from an egocentric perspective.
Baseline results show that the MECCANO dataset is a challenging benchmark to study egocentric human-object interactions in industrial-like scenarios.
arXiv Detail & Related papers (2020-10-12T12:50:30Z) - Learning Affordance Landscapes for Interaction Exploration in 3D
Environments [101.90004767771897]
Embodied agents must be able to master how their environment works.
We introduce a reinforcement learning approach for exploration for interaction.
We demonstrate our idea with AI2-iTHOR.
arXiv Detail & Related papers (2020-08-21T00:29:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.