Object-Centric Latent Action Learning
- URL: http://arxiv.org/abs/2502.09680v2
- Date: Thu, 12 Jun 2025 17:21:44 GMT
- Title: Object-Centric Latent Action Learning
- Authors: Albina Klepach, Alexander Nikulin, Ilya Zisman, Denis Tarasov, Alexander Derevyagin, Andrei Polubarov, Nikita Lyubaykin, Vladislav Kurenkov,
- Abstract summary: We propose a novel object-centric latent action learning framework that centers on objects rather than pixels.<n>We leverage self-supervised object-centric pretraining to disentangle action-related and distracting dynamics.<n>Our results show that object-centric pretraining mitigates the negative effects of distractors by 50%.
- Score: 70.3173534658611
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Leveraging vast amounts of unlabeled internet video data for embodied AI is currently bottlenecked by the lack of action labels and the presence of action-correlated visual distractors. Although recent latent action policy optimization (LAPO) has shown promise in inferring proxy-action labels from visual observations, its performance degrades significantly when distractors are present. To address this limitation, we propose a novel object-centric latent action learning framework that centers on objects rather than pixels. We leverage self-supervised object-centric pretraining to disentangle action-related and distracting dynamics. This allows LAPO to focus on task-relevant interactions, resulting in more robust proxy-action labels, enabling better imitation learning and efficient adaptation of the agent with just a few action-labeled trajectories. We evaluated our method in eight visually complex tasks across the Distracting Control Suite (DCS) and Distracting MetaWorld (DMW). Our results show that object-centric pretraining mitigates the negative effects of distractors by 50%, as measured by downstream task performance: average return (DCS) and success rate (DMW).
Related papers
- Latent Action Learning Requires Supervision in the Presence of Distractors [40.33684677920241]
We show that real-world videos contain action-correlated distractors that may hinder latent action learning.<n>We propose LAOM, a simple LAPO modification that improves the quality of latent actions by 8x.<n>We show that providing supervision with ground-truth actions, as few as 2.5% of the full dataset, during latent action learning improves downstream performance by 4.2x on average.
arXiv Detail & Related papers (2025-02-01T09:35:51Z) - Seamless Detection: Unifying Salient Object Detection and Camouflaged Object Detection [73.85890512959861]
We propose a task-agnostic framework to unify Salient Object Detection (SOD) and Camouflaged Object Detection (COD)<n>We design a simple yet effective contextual decoder involving the interval-layer and global context, which achieves an inference speed of 67 fps.<n> Experiments on public SOD and COD datasets demonstrate the superiority of our proposed framework in both supervised and unsupervised settings.
arXiv Detail & Related papers (2024-12-22T03:25:43Z) - OccludeNet: A Causal Journey into Mixed-View Actor-Centric Video Action Recognition under Occlusions [37.79525665359017]
OccludeNet is a large-scale occluded video dataset that includes both real-world and synthetic occlusion scene videos.
We propose a structural causal model for occluded scenes and introduce the Causal Action Recognition framework, which employs backdoor adjustment and counterfactual reasoning.
arXiv Detail & Related papers (2024-11-24T06:10:05Z) - Bidirectional Decoding: Improving Action Chunking via Closed-Loop Resampling [51.38330727868982]
Bidirectional Decoding (BID) is a test-time inference algorithm that bridges action chunking with closed-loop operations.<n>We show that BID boosts the performance of two state-of-the-art generative policies across seven simulation benchmarks and two real-world tasks.
arXiv Detail & Related papers (2024-08-30T15:39:34Z) - ActionVOS: Actions as Prompts for Video Object Segmentation [22.922260726461477]
ActionVOS aims at segmenting only active objects in egocentric videos using human actions as a key language prompt.
We develop an action-aware labeling module with an efficient action-guided focal loss.
Experiments show that ActionVOS significantly reduces the mis-segmentation of inactive objects.
arXiv Detail & Related papers (2024-07-10T06:57:04Z) - The impact of Compositionality in Zero-shot Multi-label action recognition for Object-based tasks [4.971065912401385]
We propose Dual-VCLIP, a unified approach for zero-shot multi-label action recognition.
Dual-VCLIP enhances VCLIP, a zero-shot action recognition method, with the DualCoOp method for multi-label image classification.
We validate our method on the Charades dataset that includes a majority of object-based actions.
arXiv Detail & Related papers (2024-05-14T15:28:48Z) - AD3: Implicit Action is the Key for World Models to Distinguish the Diverse Visual Distractors [31.565238847407112]
We propose Implicit Action Generator (IAG) to learn the implicit actions of visual distractors.
We present a new algorithm named implicit Action-informed Diverse visual Distractors Distinguisher (AD3)
Our method achieves superior performance on various visual control tasks featuring both heterogeneous and homogeneous distractors.
arXiv Detail & Related papers (2024-03-15T02:46:19Z) - Action-slot: Visual Action-centric Representations for Multi-label Atomic Activity Recognition in Traffic Scenes [23.284478293459856]
Action-slot is a slot attention-based approach that learns visual action-centric representations.
Our key idea is to design action slots that are capable of paying attention to regions where atomic activities occur.
To address the limitation, we collect a synthetic dataset called TACO, which is four times larger than OATS.
arXiv Detail & Related papers (2023-11-29T05:28:05Z) - Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z) - Sequential Action-Induced Invariant Representation for Reinforcement
Learning [1.2046159151610263]
How to accurately learn task-relevant state representations from high-dimensional observations with visual distractions is a challenging problem in visual reinforcement learning.
We propose a Sequential Action-induced invariant Representation (SAR) method, in which the encoder is optimized by an auxiliary learner to only preserve the components that follow the control signals of sequential actions.
arXiv Detail & Related papers (2023-09-22T05:31:55Z) - TACO: Temporal Latent Action-Driven Contrastive Loss for Visual Reinforcement Learning [73.53576440536682]
We introduce TACO: Temporal Action-driven Contrastive Learning, a powerful temporal contrastive learning approach.
TACO simultaneously learns a state and an action representation by optimizing the mutual information between representations of current states.
For online RL, TACO achieves 40% performance boost after one million environment interaction steps.
arXiv Detail & Related papers (2023-06-22T22:21:53Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - Leveraging Action Affinity and Continuity for Semi-supervised Temporal
Action Segmentation [24.325716686674042]
We present a semi-supervised learning approach to the temporal action segmentation task.
The goal of the task is to temporally detect and segment actions in long, untrimmed procedural videos.
We propose two novel loss functions for the unlabelled data: an action affinity loss and an action continuity loss.
arXiv Detail & Related papers (2022-07-18T14:52:37Z) - Learning to Refactor Action and Co-occurrence Features for Temporal
Action Localization [74.74339878286935]
Action features and co-occurrence features often dominate the actual action content in videos.
We develop a novel auxiliary task by decoupling these two types of features within a video snippet.
We term our method RefactorNet, which first explicitly factorizes the action content and regularizes its co-occurrence features.
arXiv Detail & Related papers (2022-06-23T06:30:08Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z) - Learning Target Candidate Association to Keep Track of What Not to Track [100.80610986625693]
We propose to keep track of distractor objects in order to continue tracking the target.
To tackle the problem of lacking ground-truth correspondences between distractor objects in visual tracking, we propose a training strategy that combines partial annotations with self-supervision.
Our tracker sets a new state-of-the-art on six benchmarks, achieving an AUC score of 67.2% on LaSOT and a +6.1% absolute gain on the OxUvA long-term dataset.
arXiv Detail & Related papers (2021-03-30T17:58:02Z) - Learning to Represent Action Values as a Hypergraph on the Action
Vertices [17.811355496708728]
Action-value estimation is a critical component of reinforcement learning (RL) methods.
We conjecture that leveraging the structure of multi-dimensional action spaces is a key ingredient for learning good representations of action.
We show the effectiveness of our approach on a myriad of domains: illustrative prediction problems under minimal confounding effects, Atari 2600 games, and discretised physical control benchmarks.
arXiv Detail & Related papers (2020-10-28T00:19:13Z) - Self-supervised Video Object Segmentation [76.83567326586162]
The objective of this paper is self-supervised representation learning, with the goal of solving semi-supervised video object segmentation (a.k.a. dense tracking)
We make the following contributions: (i) we propose to improve the existing self-supervised approach, with a simple, yet more effective memory mechanism for long-term correspondence matching; (ii) by augmenting the self-supervised approach with an online adaptation module, our method successfully alleviates tracker drifts caused by spatial-temporal discontinuity; (iv) we demonstrate state-of-the-art results among the self-supervised approaches on DAVIS-2017 and YouTube
arXiv Detail & Related papers (2020-06-22T17:55:59Z) - ZSTAD: Zero-Shot Temporal Activity Detection [107.63759089583382]
We propose a novel task setting called zero-shot temporal activity detection (ZSTAD), where activities that have never been seen in training can still be detected.
We design an end-to-end deep network based on R-C3D as the architecture for this solution.
Experiments on both the THUMOS14 and the Charades datasets show promising performance in terms of detecting unseen activities.
arXiv Detail & Related papers (2020-03-12T02:40:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.