LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
- URL: http://arxiv.org/abs/2602.21531v1
- Date: Wed, 25 Feb 2026 03:33:39 GMT
- Title: LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
- Authors: Yue Yang, Shuo Cheng, Yu Fang, Homanga Bharadhwaj, Mingyu Ding, Gedas Bertasius, Daniel Szafir,
- Abstract summary: LiLo-VLA is a modular framework capable of zero-shot modularity to novel long-horizon tasks without ever being trained on them.<n>We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long.<n>In these simulations, LiLo-VLA achieves a 69% average success rate, outperforming Pi0.5 by 41% and OpenVLA-OFT by 67%.
- Score: 54.150202739999806
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: General-purpose robots must master long-horizon manipulation, defined as tasks involving multiple kinematic structure changes (e.g., attaching or detaching objects) in unstructured environments. While Vision-Language-Action (VLA) models offer the potential to master diverse atomic skills, they struggle with the combinatorial complexity of sequencing them and are prone to cascading failures due to environmental sensitivity. To address these challenges, we propose LiLo-VLA (Linked Local VLA), a modular framework capable of zero-shot generalization to novel long-horizon tasks without ever being trained on them. Our approach decouples transport from interaction: a Reaching Module handles global motion, while an Interaction Module employs an object-centric VLA to process isolated objects of interest, ensuring robustness against irrelevant visual features and invariance to spatial configurations. Crucially, this modularity facilitates robust failure recovery through dynamic replanning and skill reuse, effectively mitigating the cascading errors common in end-to-end approaches. We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long. In these simulations, LiLo-VLA achieves a 69% average success rate, outperforming Pi0.5 by 41% and OpenVLA-OFT by 67%. Furthermore, real-world evaluations across 8 long-horizon tasks demonstrate an average success rate of 85%. Project page: https://yy-gx.github.io/LiLo-VLA/.
Related papers
- VLNVerse: A Benchmark for Vision-Language Navigation with Versatile, Embodied, Realistic Simulation and Evaluation [61.82502719679122]
We introduce VLNVerse, a benchmark for Versatile, Embodied, Realistic Simulation, and Evaluation.<n>VLNVerse redefines VLN as a scalable, full-stack embodied AI problem.<n>We propose a novel unified multi-task model capable of addressing all tasks within the benchmark.
arXiv Detail & Related papers (2025-12-22T04:27:26Z) - HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies [83.41714103649751]
Development of embodied intelligence models depends on access to high-quality robot demonstration data.<n>We present HiMoE-VLA, a novel vision-language-action framework tailored to handle diverse robotic data with heterogeneity.<n>HiMoE-VLA demonstrates a consistent performance boost over existing VLA baselines, achieving higher accuracy and robust generalizations.
arXiv Detail & Related papers (2025-12-05T13:21:05Z) - EvoVLA: Self-Evolving Vision-Language-Action Model [11.746804244345613]
Long-horizon robotic manipulation remains challenging for Vision-Language-Action (VLA) models.<n>We present EvoVLA, a self-supervised VLA framework that addresses this issue through three complementary components.<n>EvoVLA achieves one-and-a-half times better sample efficiency and reduces stage hallucination from 38.5 percent to 14.8 percent.
arXiv Detail & Related papers (2025-11-20T09:08:33Z) - LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks [31.3295171851909]
Real-world embodied agents face high-level goals demanding multi-step solutions.<n>Long-horizon tasks require high-level task planning and low-level motion control.<n>We introduce a new unified vision language framework for long-horizon tasks, dubbed LoHoVLA.
arXiv Detail & Related papers (2025-05-31T06:01:03Z) - ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation [62.58034332427291]
ForceVLA is a novel end-to-end manipulation framework.<n>It treats external force sensing as a first-class modality within VLA systems.
arXiv Detail & Related papers (2025-05-28T09:24:25Z) - GRAPE: Generalizing Robot Policy via Preference Alignment [58.419992317452376]
We present GRAPE: Generalizing Robot Policy via Preference Alignment.<n>We show GRAPE increases success rates on in-domain and unseen manipulation tasks by 51.79% and 58.20%, respectively.<n> GRAPE can be aligned with various objectives, such as safety and efficiency, reducing collision rates by 37.44% and rollout step-length by 11.15%, respectively.
arXiv Detail & Related papers (2024-11-28T18:30:10Z) - OpenVLA: An Open-Source Vision-Language-Action Model [131.74098076670103]
We introduce OpenVLA, an open-source VLA trained on a diverse collection of 970k real-world robot demonstrations.
OpenVLA shows strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate.
We release model checkpoints, fine-tuning notebooks, and our PyTorch with built-in support for training VLAs at scale on Open X-Embodiment datasets.
arXiv Detail & Related papers (2024-06-13T15:46:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.