Latent Action Learning Requires Supervision in the Presence of Distractors
- URL: http://arxiv.org/abs/2502.00379v1
- Date: Sat, 01 Feb 2025 09:35:51 GMT
- Title: Latent Action Learning Requires Supervision in the Presence of Distractors
- Authors: Alexander Nikulin, Ilya Zisman, Denis Tarasov, Nikita Lyubaykin, Andrei Polubarov, Igor Kiselev, Vladislav Kurenkov,
- Abstract summary: We show that real-world videos contain action-correlated distractors that may hinder latent action learning.<n>We propose LAOM, a simple LAPO modification that improves the quality of latent actions by 8x.<n>We show that providing supervision with ground-truth actions, as few as 2.5% of the full dataset, during latent action learning improves downstream performance by 4.2x on average.
- Score: 40.33684677920241
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, latent action learning, pioneered by Latent Action Policies (LAPO), have shown remarkable pre-training efficiency on observation-only data, offering potential for leveraging vast amounts of video available on the web for embodied AI. However, prior work has focused on distractor-free data, where changes between observations are primarily explained by ground-truth actions. Unfortunately, real-world videos contain action-correlated distractors that may hinder latent action learning. Using Distracting Control Suite (DCS) we empirically investigate the effect of distractors on latent action learning and demonstrate that LAPO struggle in such scenario. We propose LAOM, a simple LAPO modification that improves the quality of latent actions by 8x, as measured by linear probing. Importantly, we show that providing supervision with ground-truth actions, as few as 2.5% of the full dataset, during latent action learning improves downstream performance by 4.2x on average. Our findings suggest that integrating supervision during Latent Action Models (LAM) training is critical in the presence of distractors, challenging the conventional pipeline of first learning LAM and only then decoding from latent to ground-truth actions.
Related papers
- NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation [34.806610134389366]
NoisyRollout is a reinforcement learning approach that mixes trajectories from both clean and moderately distorted images.
It introduces targeted diversity in visual perception and the resulting reasoning patterns.
NoisyRollout achieves state-of-the-art performance among open-source RL-tuned models on 5 out-of-domain benchmarks.
arXiv Detail & Related papers (2025-04-17T16:10:13Z) - Fast Adaptation with Behavioral Foundation Models [82.34700481726951]
Unsupervised zero-shot reinforcement learning has emerged as a powerful paradigm for pretraining behavioral foundation models.
Despite promising results, zero-shot policies are often suboptimal due to errors induced by the unsupervised training process.
We propose fast adaptation strategies that search in the low-dimensional task-embedding space of the pre-trained BFM to rapidly improve the performance of its zero-shot policies.
arXiv Detail & Related papers (2025-04-10T16:14:17Z) - Object-Centric Latent Action Learning [70.3173534658611]
We propose a novel object-centric latent action learning approach, based on VideoSaur and LAPO.
This method effectively disentangles causal agent-object interactions from irrelevant background noise and reduces the performance degradation caused by distractors.
Our preliminary experiments with the Distracting Control Suite show that latent action pretraining based on object decompositions improve the quality of inferred latent actions by x2.7 and efficiency of downstream fine-tuning with a small set of labeled actions, increasing return by x2.6 on average.
arXiv Detail & Related papers (2025-02-13T11:27:05Z) - ACT-JEPA: Joint-Embedding Predictive Architecture Improves Policy Representation Learning [90.41852663775086]
ACT-JEPA is a novel architecture that integrates imitation learning and self-supervised learning.<n>We train a policy to predict action sequences and abstract observation sequences.<n>Our experiments show that ACT-JEPA improves the quality of representations by learning temporal environment dynamics.
arXiv Detail & Related papers (2025-01-24T16:41:41Z) - Reinforcement Learning from Delayed Observations via World Models [10.298219828693489]
In reinforcement learning settings, agents assume immediate feedback about the effects of their actions after taking them.
In practice, this assumption may not hold true due to physical constraints and can significantly impact the performance of learning algorithms.
We propose leveraging world models, which have shown success in integrating past observations and learning dynamics, to handle observation delays.
arXiv Detail & Related papers (2024-03-18T23:18:27Z) - Premier-TACO is a Few-Shot Policy Learner: Pretraining Multitask Representation via Temporal Action-Driven Contrastive Loss [61.355272240758]
Premier-TACO is a multitask feature representation learning approach.
It is designed to improve few-shot policy learning efficiency in sequential decision-making tasks.
arXiv Detail & Related papers (2024-02-09T05:04:40Z) - Learning to Act without Actions [15.244216478886543]
We introduce Latent Action Policies (LAPO), a method for recovering latent action information from videos.
LAPO is the first method able to recover the structure of the true action space just from observed dynamics.
LAPO enables training latent-action policies that can be rapidly fine-tuned into expert-level policies.
arXiv Detail & Related papers (2023-12-17T20:39:54Z) - ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP)
ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective.
We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z) - Leveraging Action Affinity and Continuity for Semi-supervised Temporal
Action Segmentation [24.325716686674042]
We present a semi-supervised learning approach to the temporal action segmentation task.
The goal of the task is to temporally detect and segment actions in long, untrimmed procedural videos.
We propose two novel loss functions for the unlabelled data: an action affinity loss and an action continuity loss.
arXiv Detail & Related papers (2022-07-18T14:52:37Z) - TRAIL: Near-Optimal Imitation Learning with Suboptimal Data [100.83688818427915]
We present training objectives that use offline datasets to learn a factored transition model.
Our theoretical analysis shows that the learned latent action space can boost the sample-efficiency of downstream imitation learning.
To learn the latent action space in practice, we propose TRAIL (Transition-Reparametrized Actions for Imitation Learning), an algorithm that learns an energy-based transition model.
arXiv Detail & Related papers (2021-10-27T21:05:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.