Related papers: AnywhereVLA: Language-Conditioned Exploration and Mobile Manipulation

AnywhereVLA: Language-Conditioned Exploration and Mobile Manipulation

URL: http://arxiv.org/abs/2509.21006v1
Date: Thu, 25 Sep 2025 11:04:44 GMT
Title: AnywhereVLA: Language-Conditioned Exploration and Mobile Manipulation
Authors: Konstantin Gubernatorov, Artem Voronov, Roman Voronov, Sergei Pasynkov, Stepan Perminov, Ziang Guo, Dzmitry Tsetserukou,
Abstract summary: AnywhereVLA is a modular framework for mobile manipulation.<n>A text prompt serves as an entry point and is parsed into a structured task graph.<n>For interaction, a compact SmolVLA manipulation head is fine tuned on platform pick and place trajectories.
Score: 1.8266092127796327
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We address natural language pick-and-place in unseen, unpredictable indoor environments with AnywhereVLA, a modular framework for mobile manipulation. A user text prompt serves as an entry point and is parsed into a structured task graph that conditions classical SLAM with LiDAR and cameras, metric semantic mapping, and a task-aware frontier exploration policy. An approach planner then selects visibility and reachability aware pre grasp base poses. For interaction, a compact SmolVLA manipulation head is fine tuned on platform pick and place trajectories for the SO-101 by TheRobotStudio, grounding local visual context and sub-goals into grasp and place proposals. The full system runs fully onboard on consumer-level hardware, with Jetson Orin NX for perception and VLA and an Intel NUC for SLAM, exploration, and control, sustaining real-time operation. We evaluated AnywhereVLA in a multi-room lab under static scenes and normal human motion. In this setting, the system achieves a $46\%$ overall task success rate while maintaining throughput on embedded compute. By combining a classical stack with a fine-tuned VLA manipulation, the system inherits the reliability of geometry-based navigation with the agility and task generalization of language-conditioned manipulation.

Related papers

TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation [70.23578202012048]
Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch.<n>We propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone.<n>To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment.<n>With the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction.
arXiv Detail & Related papers (2026-03-03T13:28:07Z)
To Move or Not to Move: Constraint-based Planning Enables Zero-Shot Generalization for Interactive Navigation [14.745622942938532]
In real-world scenarios, such as home environments and warehouses, clutter can block all routes.<n>We introduce the Lifelong Interactive Navigation problem, where a mobile robot can move clutter to forge its own path.<n>We propose an LLM-driven, constraint-based planning framework with active perception.
arXiv Detail & Related papers (2026-02-23T17:10:00Z)
DroneVLA: VLA based Aerial Manipulation [2.1645011609137295]
This work introduces a novel concept of autonomous aerial manipulation system capable of interpreting high-level natural language commands to retrieve objects and deliver them to a human user.<n>The system is intended to integrate a MediaPipe based on Grounding DINO and a Vision-Language-Action model with a custom-built drone equipped with a 1-DOF gripper and an Intel RealSense RGB-D camera.<n>We demonstrate the system's efficacy through real-world experiments for localization and navigation, which resulted in a 0.164m, 0.070m, and 0.084m of max, mean euclidean, and root-mean squared
arXiv Detail & Related papers (2026-01-20T10:08:00Z)
Executable Analytic Concepts as the Missing Link Between VLM Insight and Precise Manipulation [70.8381970762877]
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in semantic reasoning and task planning.<n>We introduce GRACE, a novel framework that grounds VLM-based reasoning through executable analytic concepts.<n>G GRACE provides a unified and interpretable interface between high-level instruction understanding and low-level robot control.
arXiv Detail & Related papers (2025-10-09T09:08:33Z)
SLAM-Free Visual Navigation with Hierarchical Vision-Language Perception and Coarse-to-Fine Semantic Topological Planning [20.12642476619467]
We propose a vision-only, SLAM-free navigation framework for legged robot navigation.<n>A hierarchical vision-language perception module fuses scene-level context with object-level cues for robust semantic inference.<n> integrated with reinforcement-learning controllers, the framework is deployable across diverse legged robot platforms.
arXiv Detail & Related papers (2025-09-25T04:38:45Z)
OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation [49.66156306240961]
We present a training framework for robotic foundation models that enables omni-modal goal conditioning for vision-based navigation.<n>Our approach leverages a high-capacity vision-language-action backbone and trains with three primary goal modalities.<n>We demonstrate that OmniVLA outperforms specialist baselines across modalities and offers a flexible foundation for fine-tuning to new modalities and tasks.
arXiv Detail & Related papers (2025-09-23T18:40:29Z)
TANGO: Traversability-Aware Navigation with Local Metric Control for Topological Goals [10.69725316052444]
We present a novel RGB-only, object-level topometric navigation pipeline that enables zero-shot, long-horizon robot navigation.<n>Our approach integrates global topological path planning with local metric trajectory control, allowing the robot to navigate towards object-level sub-goals while avoiding obstacles.<n>We demonstrate the effectiveness of our method in both simulated environments and real-world tests, highlighting its robustness and deployability.
arXiv Detail & Related papers (2025-09-10T15:43:32Z)
Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation [65.30763239365928]
We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation.<n>GE integrates policy learning, evaluation, and simulation within a single video-generative framework.
arXiv Detail & Related papers (2025-08-07T17:59:44Z)
ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation [62.58034332427291]
ForceVLA is a novel end-to-end manipulation framework.<n>It treats external force sensing as a first-class modality within VLA systems.
arXiv Detail & Related papers (2025-05-28T09:24:25Z)
DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control [53.80518003412016]
Building a general-purpose intelligent home-assistant agent skilled in diverse tasks by human commands is a long-term blueprint of embodied AI research. We study primitive mobile manipulations for embodied agents, i.e. how to navigate and interact based on an instructed verb-noun pair. We propose DISCO, which features non-trivial advancements in contextualized scene modeling and efficient controls.
arXiv Detail & Related papers (2024-07-20T05:39:28Z)
Towards Open-World Grasping with Large Vision-Language Models [5.317624228510749]
An open-world grasping system should be able to combine high-level contextual with low-level physical-geometric reasoning. We propose OWG, an open-world grasping pipeline that combines vision-language models with segmentation and grasp synthesis models. We conduct evaluation in cluttered indoor scene datasets to showcase OWG's robustness in grounding from open-ended language.
arXiv Detail & Related papers (2024-06-26T19:42:08Z)
Language-EXtended Indoor SLAM (LEXIS): A Versatile System for Real-time Visual Scene Understanding [0.0]
LEXIS is a real-time indoor Simultaneous Localization and Mapping system. It harnesses the open-vocabulary nature of Large Language Models to create a unified approach to scene understanding and place recognition. It successfully categorizes rooms with varying layouts and dimensions and outperforms the state-of-the-art (SOTA)
arXiv Detail & Related papers (2023-09-26T16:50:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.