AnywhereVLA: Language-Conditioned Exploration and Mobile Manipulation
- URL: http://arxiv.org/abs/2509.21006v1
- Date: Thu, 25 Sep 2025 11:04:44 GMT
- Title: AnywhereVLA: Language-Conditioned Exploration and Mobile Manipulation
- Authors: Konstantin Gubernatorov, Artem Voronov, Roman Voronov, Sergei Pasynkov, Stepan Perminov, Ziang Guo, Dzmitry Tsetserukou,
- Abstract summary: AnywhereVLA is a modular framework for mobile manipulation.<n>A text prompt serves as an entry point and is parsed into a structured task graph.<n>For interaction, a compact SmolVLA manipulation head is fine tuned on platform pick and place trajectories.
- Score: 1.8266092127796327
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We address natural language pick-and-place in unseen, unpredictable indoor environments with AnywhereVLA, a modular framework for mobile manipulation. A user text prompt serves as an entry point and is parsed into a structured task graph that conditions classical SLAM with LiDAR and cameras, metric semantic mapping, and a task-aware frontier exploration policy. An approach planner then selects visibility and reachability aware pre grasp base poses. For interaction, a compact SmolVLA manipulation head is fine tuned on platform pick and place trajectories for the SO-101 by TheRobotStudio, grounding local visual context and sub-goals into grasp and place proposals. The full system runs fully onboard on consumer-level hardware, with Jetson Orin NX for perception and VLA and an Intel NUC for SLAM, exploration, and control, sustaining real-time operation. We evaluated AnywhereVLA in a multi-room lab under static scenes and normal human motion. In this setting, the system achieves a $46\%$ overall task success rate while maintaining throughput on embedded compute. By combining a classical stack with a fine-tuned VLA manipulation, the system inherits the reliability of geometry-based navigation with the agility and task generalization of language-conditioned manipulation.
Related papers
- TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation [70.23578202012048]
Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch.<n>We propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone.<n>To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment.<n>With the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction.
arXiv Detail & Related papers (2026-03-03T13:28:07Z) - To Move or Not to Move: Constraint-based Planning Enables Zero-Shot Generalization for Interactive Navigation [14.745622942938532]
In real-world scenarios, such as home environments and warehouses, clutter can block all routes.<n>We introduce the Lifelong Interactive Navigation problem, where a mobile robot can move clutter to forge its own path.<n>We propose an LLM-driven, constraint-based planning framework with active perception.
arXiv Detail & Related papers (2026-02-23T17:10:00Z) - DroneVLA: VLA based Aerial Manipulation [2.1645011609137295]
This work introduces a novel concept of autonomous aerial manipulation system capable of interpreting high-level natural language commands to retrieve objects and deliver them to a human user.<n>The system is intended to integrate a MediaPipe based on Grounding DINO and a Vision-Language-Action model with a custom-built drone equipped with a 1-DOF gripper and an Intel RealSense RGB-D camera.<n>We demonstrate the system's efficacy through real-world experiments for localization and navigation, which resulted in a 0.164m, 0.070m, and 0.084m of max, mean euclidean, and root-mean squared
arXiv Detail & Related papers (2026-01-20T10:08:00Z) - Executable Analytic Concepts as the Missing Link Between VLM Insight and Precise Manipulation [70.8381970762877]
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in semantic reasoning and task planning.<n>We introduce GRACE, a novel framework that grounds VLM-based reasoning through executable analytic concepts.<n>G GRACE provides a unified and interpretable interface between high-level instruction understanding and low-level robot control.
arXiv Detail & Related papers (2025-10-09T09:08:33Z) - SLAM-Free Visual Navigation with Hierarchical Vision-Language Perception and Coarse-to-Fine Semantic Topological Planning [20.12642476619467]
We propose a vision-only, SLAM-free navigation framework for legged robot navigation.<n>A hierarchical vision-language perception module fuses scene-level context with object-level cues for robust semantic inference.<n> integrated with reinforcement-learning controllers, the framework is deployable across diverse legged robot platforms.
arXiv Detail & Related papers (2025-09-25T04:38:45Z) - OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation [49.66156306240961]
We present a training framework for robotic foundation models that enables omni-modal goal conditioning for vision-based navigation.<n>Our approach leverages a high-capacity vision-language-action backbone and trains with three primary goal modalities.<n>We demonstrate that OmniVLA outperforms specialist baselines across modalities and offers a flexible foundation for fine-tuning to new modalities and tasks.
arXiv Detail & Related papers (2025-09-23T18:40:29Z) - TANGO: Traversability-Aware Navigation with Local Metric Control for Topological Goals [10.69725316052444]
We present a novel RGB-only, object-level topometric navigation pipeline that enables zero-shot, long-horizon robot navigation.<n>Our approach integrates global topological path planning with local metric trajectory control, allowing the robot to navigate towards object-level sub-goals while avoiding obstacles.<n>We demonstrate the effectiveness of our method in both simulated environments and real-world tests, highlighting its robustness and deployability.
arXiv Detail & Related papers (2025-09-10T15:43:32Z) - Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation [65.30763239365928]
We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation.<n>GE integrates policy learning, evaluation, and simulation within a single video-generative framework.
arXiv Detail & Related papers (2025-08-07T17:59:44Z) - ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation [62.58034332427291]
ForceVLA is a novel end-to-end manipulation framework.<n>It treats external force sensing as a first-class modality within VLA systems.
arXiv Detail & Related papers (2025-05-28T09:24:25Z) - DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control [53.80518003412016]
Building a general-purpose intelligent home-assistant agent skilled in diverse tasks by human commands is a long-term blueprint of embodied AI research.
We study primitive mobile manipulations for embodied agents, i.e. how to navigate and interact based on an instructed verb-noun pair.
We propose DISCO, which features non-trivial advancements in contextualized scene modeling and efficient controls.
arXiv Detail & Related papers (2024-07-20T05:39:28Z) - Towards Open-World Grasping with Large Vision-Language Models [5.317624228510749]
An open-world grasping system should be able to combine high-level contextual with low-level physical-geometric reasoning.
We propose OWG, an open-world grasping pipeline that combines vision-language models with segmentation and grasp synthesis models.
We conduct evaluation in cluttered indoor scene datasets to showcase OWG's robustness in grounding from open-ended language.
arXiv Detail & Related papers (2024-06-26T19:42:08Z) - Language-EXtended Indoor SLAM (LEXIS): A Versatile System for Real-time
Visual Scene Understanding [0.0]
LEXIS is a real-time indoor Simultaneous Localization and Mapping system.
It harnesses the open-vocabulary nature of Large Language Models to create a unified approach to scene understanding and place recognition.
It successfully categorizes rooms with varying layouts and dimensions and outperforms the state-of-the-art (SOTA)
arXiv Detail & Related papers (2023-09-26T16:50:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.