Related papers: Atomic Action Slicing: Planner-Aligned Options for Generalist VLA Agents

Atomic Action Slicing: Planner-Aligned Options for Generalist VLA Agents

URL: http://arxiv.org/abs/2512.11584v1
Date: Fri, 12 Dec 2025 14:14:27 GMT
Title: Atomic Action Slicing: Planner-Aligned Options for Generalist VLA Agents
Authors: Stefan Tabakov, Asen Popov, Dimitar Dimitrov, S. Ensiye Kiyamousavi, Vladimir Hristov, Boris Kraychev,
Abstract summary: Current vision--action models generalize poorly when tasks require new compositions of skills or objects.<n>We introduce Atomic Action Slicing (AAS), a planner-aligned approach that decomposes long-horizon demonstrations into short, typed atomic actions.<n>AAS produces a validated dataset of 2,124 atomic segments labeled with action type, temporal span, and confidence.
Score: 2.027211672314502
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current vision-language-action (VLA) models generalize poorly, particularly when tasks require new compositions of skills or objects. We introduce Atomic Action Slicing (AAS), a planner-aligned approach that decomposes long-horizon demonstrations into short, typed atomic actions that are easier for planners to use and policies to learn. Using LIBERO demonstrations, AAS produces a validated dataset of 2,124 atomic segments labeled with action type, temporal span, and confidence. A stronger segmenter (Gemini 2.5 Pro) closely matches planner-defined plans and remains robust under keyframe jitter, while smaller models perform worse on multi-object tasks. Fine-tuning CLIP-RT+ on our atomic dataset improves task success from 94.2% to 95.3% on LIBERO-Goal and 83.8% to 88.8% on LIBERO-Long. We publicly release the GATE-VLAP dataset on HuggingFace(https://huggingface.co/datasets/gate-institute/GATE-VLAP-datasets)

Related papers

LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies [54.150202739999806]
LiLo-VLA is a modular framework capable of zero-shot modularity to novel long-horizon tasks without ever being trained on them.<n>We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long.<n>In these simulations, LiLo-VLA achieves a 69% average success rate, outperforming Pi0.5 by 41% and OpenVLA-OFT by 67%.
arXiv Detail & Related papers (2026-02-25T03:33:39Z)
MagicAgent: Towards Generalized Agent Planning [73.21129030631421]
We present textbfMagicAgent, a series of foundation models specifically designed for generalized agent planning.<n>We introduce a lightweight and scalable synthetic data framework that generates high-quality trajectories across diverse planning tasks.<n>We show that MagicAgent-32B and MagicAgent-30B-A3B achieve superior performance across diverse open-source benchmarks.
arXiv Detail & Related papers (2026-02-22T01:39:16Z)
Demonstration-Free Robotic Control via LLM Agents [0.0]
We introduce FAEA (Frontier Agent as Embodied Agent), which applies an LLM agent framework directly to embodied manipulation without modification.<n>With privileged environment state access, FAEA achieves success rates of 84.9%, 85.7%, and 96%, respectively.<n>Our results indicate that general-purpose agents are sufficient for a class of manipulation tasks dominated by deliberative, task-level planning.
arXiv Detail & Related papers (2026-01-28T07:49:35Z)
Stay Unique, Stay Efficient: Preserving Model Personality in Multi-Task Merging [62.61159948488935]
Decomposition, Thresholding, and Scaling (DTS) is an approximation-based personalized merging framework.<n>DTS preserves task-specific information with minimal storage overhead.<n>We extend DTS with a variant that fuses task-specific information in a data-free manner based on the semantic similarity of task characteristics.
arXiv Detail & Related papers (2025-12-01T09:47:17Z)
ManiAgent: An Agentic Framework for General Robotic Manipulation [30.154478145473792]
We introduce ManiAgent, an agentic architecture for general manipulation tasks.<n>Multiple agents involve inter-agent communication to perform environmental perception, sub-task decomposition and action generation.<n>ManiAgent achieves an 86.8% success rate on the SimplerEnv benchmark and 95.8% on real-world pick-and-place tasks.
arXiv Detail & Related papers (2025-10-13T17:34:48Z)
HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy [61.668591984635846]
HAMLET is a framework to adapt Vision-Language-Action models to attend to the historical context during action prediction.<n>We show that HAMLET successfully transforms a state-of-the-art VLA into a history-aware policy.<n>On top of GR00T N1.5, HAMLET achieves an average success rate of 76.4% on history-dependent real-world tasks.
arXiv Detail & Related papers (2025-10-01T09:15:52Z)
GTA1: GUI Test-time Scaling Agent [97.58177633084915]
Graphical user interface (GUI) agents autonomously complete tasks across platforms (eg, Linux) by sequentially decomposing user instructions into action proposals.<n>This paper investigates the aforementioned challenges with our textbfGUI textbfTest-time Scaling textbfAgent, namely GTA1.
arXiv Detail & Related papers (2025-07-08T08:52:18Z)
ProTIP: Progressive Tool Retrieval Improves Planning [14.386337505825228]
We introduce the Progressive Tool retrieval to Improve Planning (ProTIP) framework. ProTIP implicitly performs TD without the explicit requirement of subtask labels, while simultaneously maintaining subtask-tool atomicity. On the ToolBench dataset, ProTIP outperforms the ChatGPT task decomposition-based approach by a remarkable margin.
arXiv Detail & Related papers (2023-12-16T05:43:11Z)
Annotator: A Generic Active Learning Baseline for LiDAR Semantic Segmentation [40.803251337200656]
Annotator is a general and efficient active learning baseline. voxel-centric online selection strategy is tailored to efficiently probe and annotate the salient and exemplar voxel girds within each LiDAR scan. Annotator excels in diverse settings, with a particular focus on active learning (AL), active source-free domain adaptation (ASFDA), and active domain adaptation (ADA)
arXiv Detail & Related papers (2023-10-31T09:04:39Z)
Embodied Task Planning with Large Language Models [86.63533340293361]
We propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning with physical scene constraint. During inference, we discover the objects in the scene by extending open-vocabulary object detectors to multi-view RGB images collected in different achievable locations. Experimental results show that the generated plan from our TaPA framework can achieve higher success rate than LLaVA and GPT-3.5 by a sizable margin.
arXiv Detail & Related papers (2023-07-04T17:58:25Z)
MM-SEAL: A Large-scale Video Dataset of Multi-person Multi-grained Spatio-temporally Action Localization [19.721688276051363]
We are the first to propose a new benchmark for multi-person complex activity localization.<n>We observe that limited atomic actions can be combined into many complex activities.<n> MM-SEAL provides both atomic action and complex activity annotations, producing 111.7k atomic actions spanning 172 action categories and 17.7k complex activities spanning 200 activity categories.
arXiv Detail & Related papers (2022-04-06T09:27:52Z)
Semi-Supervised Few-Shot Atomic Action Recognition [59.587738451616495]
We propose a novel model for semi-supervised few-shot atomic action recognition. Our model features unsupervised and contrastive video embedding, loose action alignment, multi-head feature comparison, and attention-based aggregation. Experiments show that our model can attain high accuracy on representative atomic action datasets outperforming their respective state-of-the-art classification accuracy in full supervision setting.
arXiv Detail & Related papers (2020-11-17T03:59:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.