Bridging Embodiment Gaps: Deploying Vision-Language-Action Models on Soft Robots
- URL: http://arxiv.org/abs/2510.17369v1
- Date: Mon, 20 Oct 2025 10:06:39 GMT
- Title: Bridging Embodiment Gaps: Deploying Vision-Language-Action Models on Soft Robots
- Authors: Haochen Su, Cristian Meo, Francesco Stella, Andrea Peirone, Kai Junge, Josie Hughes,
- Abstract summary: Vision-Language-Action (VLA) models have been proposed as a language guided generalized control framework for real robots.<n>We present the deployment of a VLA model on a soft continuum manipulator to demonstrate autonomous safe human-robot interaction.
- Score: 5.993870098970107
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Robotic systems are increasingly expected to operate in human-centered, unstructured environments where safety, adaptability, and generalization are essential. Vision-Language-Action (VLA) models have been proposed as a language guided generalized control framework for real robots. However, their deployment has been limited to conventional serial link manipulators. Coupled by their rigidity and unpredictability of learning based control, the ability to safely interact with the environment is missing yet critical. In this work, we present the deployment of a VLA model on a soft continuum manipulator to demonstrate autonomous safe human-robot interaction. We present a structured finetuning and deployment pipeline evaluating two state-of-the-art VLA models (OpenVLA-OFT and $\pi_0$) across representative manipulation tasks, and show while out-of-the-box policies fail due to embodiment mismatch, through targeted finetuning the soft robot performs equally to the rigid counterpart. Our findings highlight the necessity of finetuning for bridging embodiment gaps, and demonstrate that coupling VLA models with soft robots enables safe and flexible embodied AI in human-shared environments.
Related papers
- IROSA: Interactive Robot Skill Adaptation using Natural Language [9.66356526923778]
We present a novel framework that enables open-vocabulary skill adaptation through a tool-based architecture.<n>We demonstrate the framework on a 7-DoF torque-controlled robot performing an industrial bearing ring insertion task.
arXiv Detail & Related papers (2026-03-04T09:54:09Z) - MiVLA: Towards Generalizable Vision-Language-Action Model with Human-Robot Mutual Imitation Pre-training [102.850162490626]
We propose MiVLA, a vision-language-action model empowered by human-robot mutual imitation pre-training.<n>We show that MiVLA achieves strong improved generalization capability, outperforming state-of-the-art VLAs.
arXiv Detail & Related papers (2025-12-17T12:59:41Z) - Mechanistic Finetuning of Vision-Language-Action Models via Few-Shot Demonstrations [76.79742393097358]
Vision-Language Action (VLAs) models promise to extend the remarkable success of vision-language models (VLMs) to robotics.<n>Existing fine-tuning methods lack specificity, adapting the same set of parameters regardless of a task's visual, linguistic, and physical characteristics.<n>Inspired by functional specificity in neuroscience, we hypothesize that it is more effective to finetune sparse model representations specific to a given task.
arXiv Detail & Related papers (2025-11-27T18:50:21Z) - RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation [47.79800816696372]
Real-world testing of manipulation policies is labor-intensive at scale, and difficult to reproduce.<n>Existing simulation benchmarks are similarly limited, as they train and test policies within the same synthetic domains.<n>In this paper, we introduce a new benchmarking framework that overcomes these challenges by shifting VLA evaluation into large-scale simulated augmented environments.
arXiv Detail & Related papers (2025-10-27T17:41:38Z) - Mechanistic interpretability for steering vision-language-action models [0.23371356738437823]
Vision-Language-Action (VLA) models are a promising path to realizing generalist embodied agents.<n>We introduce the first framework for interpreting and steering VLAs via their internal representations.<n>We introduce a general-purpose activation steering method that modulates behavior in real time, without fine-tuning, reward signals, or environment interaction.
arXiv Detail & Related papers (2025-08-30T03:01:57Z) - Human-assisted Robotic Policy Refinement via Action Preference Optimization [26.144183856600687]
Action Preference Optimization (APO) is a method designed to refine Vision-Language-Action (VLA) models by human-assisted preference alignment.<n>To solve this, APO proposes an adaptive reweighting algorithm with binary desirability signals derived from interaction.<n>Experiments conducted in simulation and real-world scenarios prove superior generalization and robustness.
arXiv Detail & Related papers (2025-06-08T13:14:18Z) - ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation [62.58034332427291]
ForceVLA is a novel end-to-end manipulation framework.<n>It treats external force sensing as a first-class modality within VLA systems.
arXiv Detail & Related papers (2025-05-28T09:24:25Z) - Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics [68.36528819227641]
This paper systematically evaluates the robustness of Vision-Language-Action (VLA) models.<n>We introduce two untargeted attack objectives that leverage spatial foundations to destabilize robotic actions, and a targeted attack objective that manipulates the robotic trajectory.<n>We design an adversarial patch generation approach that places a small, colorful patch within the camera's view, effectively executing the attack in both digital and physical environments.
arXiv Detail & Related papers (2024-11-18T01:52:20Z) - GRAPPA: Generalizing and Adapting Robot Policies via Online Agentic Guidance [15.774237279917594]
We propose an agentic framework for robot self-guidance and self-improvement.<n>Our framework iteratively grounds a base robot policy to relevant objects in the environment.<n>We demonstrate that our approach can effectively guide manipulation policies to achieve significantly higher success rates.
arXiv Detail & Related papers (2024-10-09T02:00:37Z) - Robotic Control via Embodied Chain-of-Thought Reasoning [86.6680905262442]
Key limitation of learned robot control policies is their inability to generalize outside their training data.<n>Recent works on vision-language-action models (VLAs) have shown that the use of large, internet pre-trained vision-language models can substantially improve their robustness and generalization ability.<n>We introduce Embodied Chain-of-Thought Reasoning (ECoT) for VLAs, in which we train VLAs to perform multiple steps of reasoning about plans, sub-tasks, motions, and visually grounded features before predicting the robot action.
arXiv Detail & Related papers (2024-07-11T17:31:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.