Related papers: Learning Goals from Failure

Learning Goals from Failure

URL: http://arxiv.org/abs/2006.15657v2
Date: Sun, 13 Dec 2020 01:44:08 GMT
Title: Learning Goals from Failure
Authors: Dave Epstein and Carl Vondrick
Abstract summary: We introduce a framework that predicts the goals behind observable human action in video. Motivated by evidence in developmental psychology, we leverage video of unintentional action to learn video representations of goals without direct supervision.
Score: 30.071336708348472
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce a framework that predicts the goals behind observable human action in video. Motivated by evidence in developmental psychology, we leverage video of unintentional action to learn video representations of goals without direct supervision. Our approach models videos as contextual trajectories that represent both low-level motion and high-level action features. Experiments and visualizations show our trained model is able to predict the underlying goals in video of unintentional action. We also propose a method to "automatically correct" unintentional action by leveraging gradient signals of our model to adjust latent trajectories. Although the model is trained with minimal supervision, it is competitive with or outperforms baselines trained on large (supervised) datasets of successfully executed goals, showing that observing unintentional action is crucial to learning about goals in video. Project page: https://aha.cs.columbia.edu/

Related papers

Target-Aware Video Diffusion Models [9.01174307678548]
We present a target-aware video diffusion model that generates videos from an input image in which an actor interacts with a specified target. Unlike existing controllable image-to-video diffusion models that often rely on dense structural or motion cues to guide the actor's movements toward the target, our target-aware model requires only a simple mask to indicate the target.
arXiv Detail & Related papers (2025-03-24T17:59:59Z)
Unified Video Action Model [47.88377984526902]
A unified video and action model holds significant promise for robotics, where videos provide rich scene information for action prediction. We introduce the Unified Video Action model (UVA), which jointly optimize video and action predictions to achieve both high accuracy and efficient action inference. Via an extensive set of experiments, we demonstrate that UVA can serve as a general-purpose solution for a wide range of robotics tasks.
arXiv Detail & Related papers (2025-02-28T21:38:17Z)
An Empirical Study of Autoregressive Pre-training from Videos [67.15356613065542]
We treat videos as visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance.
arXiv Detail & Related papers (2025-01-09T18:59:58Z)
Video2Reward: Generating Reward Function from Videos for Legged Robot Behavior Learning [27.233232260388682]
We introduce a new video2reward method, which directly generates reward functions from videos depicting the behaviors to be mimicked and learned. Our method surpasses the performance of state-of-the-art LLM-based reward generation methods by over 37.6% in terms of human normalized score.
arXiv Detail & Related papers (2024-12-07T03:10:27Z)
Grounding Video Models to Actions through Goal Conditioned Exploration [29.050431676226115]
We propose a framework that uses trajectory level action generation in combination with video guidance to enable an agent to solve complex tasks. We show how our approach is on par with or even surpasses multiple behavior cloning baselines trained on expert demonstrations.
arXiv Detail & Related papers (2024-11-11T18:43:44Z)
Latent Action Pretraining from Videos [156.88613023078778]
We introduce Latent Action Pretraining for general Action models (LAPA) LAPA is an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. We propose a method to learn from internet-scale videos that do not have robot action labels.
arXiv Detail & Related papers (2024-10-15T16:28:09Z)
Zero-Shot Offline Imitation Learning via Optimal Transport [21.548195072895517]
Zero-shot imitation learning algorithms reproduce unseen behavior from as little as a single demonstration at test time. Existing practical approaches view the expert demonstration as a sequence of goals, enabling imitation with a high-level goal selector, and a low-level goal-conditioned policy. We introduce a novel method that mitigates this issue by directly optimizing the occupancy matching objective that is intrinsic to imitation learning.
arXiv Detail & Related papers (2024-10-11T12:10:51Z)
WANDR: Intention-guided Human Motion Generation [67.07028110459787]
We introduce WANDR, a data-driven model that takes an avatar's initial pose and a goal's 3D position and generates natural human motions that place the end effector (wrist) on the goal location. Intention guides the agent to the goal, and interactively adapts the generation to novel situations without needing to define sub-goals or the entire motion path. We evaluate our method extensively and demonstrate its ability to generate natural and long-term motions that reach 3D goals and to unseen goal locations.
arXiv Detail & Related papers (2024-04-23T10:20:17Z)
REST: REtrieve & Self-Train for generative action recognition [54.90704746573636]
We propose to adapt a pre-trained generative Vision & Language (V&L) Foundation Model for video/action recognition. We show that direct fine-tuning of a generative model to produce action classes suffers from severe overfitting. We introduce REST, a training framework consisting of two key components.
arXiv Detail & Related papers (2022-09-29T17:57:01Z)
Tragedy Plus Time: Capturing Unintended Human Activities from Weakly-labeled Videos [31.1632730473261]
W-Oops consists of 2,100 unintentional human action videos, with 44 goal-directed and 30 unintentional video-level activity labels collected through human annotations. We propose a weakly supervised algorithm for localizing the goal-directed as well as unintentional temporal regions in the video.
arXiv Detail & Related papers (2022-04-28T14:56:43Z)
Reinforcement Learning with Action-Free Pre-Training from Videos [95.25074614579646]
We introduce a framework that learns representations useful for understanding the dynamics via generative pre-training on videos. Our framework significantly improves both final performances and sample-efficiency of vision-based reinforcement learning.
arXiv Detail & Related papers (2022-03-25T19:44:09Z)
Procedure Planning in Instructional Videosvia Contextual Modeling and Model-based Policy Learning [114.1830997893756]
This work focuses on learning a model to plan goal-directed actions in real-life videos. We propose novel algorithms to model human behaviors through Bayesian Inference and model-based Imitation Learning.
arXiv Detail & Related papers (2021-10-05T01:06:53Z)
Model-Based Visual Planning with Self-Supervised Functional Distances [104.83979811803466]
We present a self-supervised method for model-based visual goal reaching. Our approach learns entirely using offline, unlabeled data. We find that this approach substantially outperforms both model-free and model-based prior methods.
arXiv Detail & Related papers (2020-12-30T23:59:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.