Atomic Action Slicing: Planner-Aligned Options for Generalist VLA Agents
- URL: http://arxiv.org/abs/2512.11584v1
- Date: Fri, 12 Dec 2025 14:14:27 GMT
- Title: Atomic Action Slicing: Planner-Aligned Options for Generalist VLA Agents
- Authors: Stefan Tabakov, Asen Popov, Dimitar Dimitrov, S. Ensiye Kiyamousavi, Vladimir Hristov, Boris Kraychev,
- Abstract summary: Current vision--action models generalize poorly when tasks require new compositions of skills or objects.<n>We introduce Atomic Action Slicing (AAS), a planner-aligned approach that decomposes long-horizon demonstrations into short, typed atomic actions.<n>AAS produces a validated dataset of 2,124 atomic segments labeled with action type, temporal span, and confidence.
- Score: 2.027211672314502
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current vision-language-action (VLA) models generalize poorly, particularly when tasks require new compositions of skills or objects. We introduce Atomic Action Slicing (AAS), a planner-aligned approach that decomposes long-horizon demonstrations into short, typed atomic actions that are easier for planners to use and policies to learn. Using LIBERO demonstrations, AAS produces a validated dataset of 2,124 atomic segments labeled with action type, temporal span, and confidence. A stronger segmenter (Gemini 2.5 Pro) closely matches planner-defined plans and remains robust under keyframe jitter, while smaller models perform worse on multi-object tasks. Fine-tuning CLIP-RT+ on our atomic dataset improves task success from 94.2% to 95.3% on LIBERO-Goal and 83.8% to 88.8% on LIBERO-Long. We publicly release the GATE-VLAP dataset on HuggingFace(https://huggingface.co/datasets/gate-institute/GATE-VLAP-datasets)
Related papers
- LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies [54.150202739999806]
LiLo-VLA is a modular framework capable of zero-shot modularity to novel long-horizon tasks without ever being trained on them.<n>We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long.<n>In these simulations, LiLo-VLA achieves a 69% average success rate, outperforming Pi0.5 by 41% and OpenVLA-OFT by 67%.
arXiv Detail & Related papers (2026-02-25T03:33:39Z) - MagicAgent: Towards Generalized Agent Planning [73.21129030631421]
We present textbfMagicAgent, a series of foundation models specifically designed for generalized agent planning.<n>We introduce a lightweight and scalable synthetic data framework that generates high-quality trajectories across diverse planning tasks.<n>We show that MagicAgent-32B and MagicAgent-30B-A3B achieve superior performance across diverse open-source benchmarks.
arXiv Detail & Related papers (2026-02-22T01:39:16Z) - Demonstration-Free Robotic Control via LLM Agents [0.0]
We introduce FAEA (Frontier Agent as Embodied Agent), which applies an LLM agent framework directly to embodied manipulation without modification.<n>With privileged environment state access, FAEA achieves success rates of 84.9%, 85.7%, and 96%, respectively.<n>Our results indicate that general-purpose agents are sufficient for a class of manipulation tasks dominated by deliberative, task-level planning.
arXiv Detail & Related papers (2026-01-28T07:49:35Z) - Stay Unique, Stay Efficient: Preserving Model Personality in Multi-Task Merging [62.61159948488935]
Decomposition, Thresholding, and Scaling (DTS) is an approximation-based personalized merging framework.<n>DTS preserves task-specific information with minimal storage overhead.<n>We extend DTS with a variant that fuses task-specific information in a data-free manner based on the semantic similarity of task characteristics.
arXiv Detail & Related papers (2025-12-01T09:47:17Z) - ManiAgent: An Agentic Framework for General Robotic Manipulation [30.154478145473792]
We introduce ManiAgent, an agentic architecture for general manipulation tasks.<n>Multiple agents involve inter-agent communication to perform environmental perception, sub-task decomposition and action generation.<n>ManiAgent achieves an 86.8% success rate on the SimplerEnv benchmark and 95.8% on real-world pick-and-place tasks.
arXiv Detail & Related papers (2025-10-13T17:34:48Z) - HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy [61.668591984635846]
HAMLET is a framework to adapt Vision-Language-Action models to attend to the historical context during action prediction.<n>We show that HAMLET successfully transforms a state-of-the-art VLA into a history-aware policy.<n>On top of GR00T N1.5, HAMLET achieves an average success rate of 76.4% on history-dependent real-world tasks.
arXiv Detail & Related papers (2025-10-01T09:15:52Z) - GTA1: GUI Test-time Scaling Agent [97.58177633084915]
Graphical user interface (GUI) agents autonomously complete tasks across platforms (eg, Linux) by sequentially decomposing user instructions into action proposals.<n>This paper investigates the aforementioned challenges with our textbfGUI textbfTest-time Scaling textbfAgent, namely GTA1.
arXiv Detail & Related papers (2025-07-08T08:52:18Z) - ProTIP: Progressive Tool Retrieval Improves Planning [14.386337505825228]
We introduce the Progressive Tool retrieval to Improve Planning (ProTIP) framework.
ProTIP implicitly performs TD without the explicit requirement of subtask labels, while simultaneously maintaining subtask-tool atomicity.
On the ToolBench dataset, ProTIP outperforms the ChatGPT task decomposition-based approach by a remarkable margin.
arXiv Detail & Related papers (2023-12-16T05:43:11Z) - Annotator: A Generic Active Learning Baseline for LiDAR Semantic
Segmentation [40.803251337200656]
Annotator is a general and efficient active learning baseline.
voxel-centric online selection strategy is tailored to efficiently probe and annotate the salient and exemplar voxel girds within each LiDAR scan.
Annotator excels in diverse settings, with a particular focus on active learning (AL), active source-free domain adaptation (ASFDA), and active domain adaptation (ADA)
arXiv Detail & Related papers (2023-10-31T09:04:39Z) - Embodied Task Planning with Large Language Models [86.63533340293361]
We propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning with physical scene constraint.
During inference, we discover the objects in the scene by extending open-vocabulary object detectors to multi-view RGB images collected in different achievable locations.
Experimental results show that the generated plan from our TaPA framework can achieve higher success rate than LLaVA and GPT-3.5 by a sizable margin.
arXiv Detail & Related papers (2023-07-04T17:58:25Z) - MM-SEAL: A Large-scale Video Dataset of Multi-person Multi-grained Spatio-temporally Action Localization [19.721688276051363]
We are the first to propose a new benchmark for multi-person complex activity localization.<n>We observe that limited atomic actions can be combined into many complex activities.<n> MM-SEAL provides both atomic action and complex activity annotations, producing 111.7k atomic actions spanning 172 action categories and 17.7k complex activities spanning 200 activity categories.
arXiv Detail & Related papers (2022-04-06T09:27:52Z) - Semi-Supervised Few-Shot Atomic Action Recognition [59.587738451616495]
We propose a novel model for semi-supervised few-shot atomic action recognition.
Our model features unsupervised and contrastive video embedding, loose action alignment, multi-head feature comparison, and attention-based aggregation.
Experiments show that our model can attain high accuracy on representative atomic action datasets outperforming their respective state-of-the-art classification accuracy in full supervision setting.
arXiv Detail & Related papers (2020-11-17T03:59:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.