Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms
Abstract Overview
This paper presents a survey of safety in Vision-Language-Action (VLA) models, which unify visual perception, language understanding, and action generation for embodied robotics. It argues that VLA safety differs from text-only LLM safety and classical robotics safety because actions have irreversible physical consequences, the attack surface spans multiple modalities (vision, language, and proprioceptive state), defenses must operate under real-time latency constraints, and errors can compound over long-horizon trajectories. The survey organizes prior work along two parallel timing axes—attack timing (training-time vs. inference-time) and defense timing (training-time vs. inference-time)—and reviews training-time attacks (data poisoning, backdoors), inference-time attacks (adversarial patches, jailbreaks, physical interventions), corresponding defenses, evaluation benchmarks and metrics, and deployment challenges across six real-world domains. It also provides background on representative VLA systems, formal problem formulations, architectural components, training paradigms, and inference mechanisms to ground the safety discussion.
Novelty
The paper presents what it describes as the first comprehensive survey focused specifically on VLA safety. Its distinctive contribution is a structured taxonomy organized along parallel attack-time and defense-time axes that systematically connects threats, mitigations, evaluation protocols, and deployment considerations within a single framework, bridging previously fragmented literatures across robotic learning, adversarial machine learning, AI alignment, and autonomous systems safety.
Results
As a survey, the paper's main outcome is a consolidated mapping of the VLA safety landscape rather than new experimental results. It synthesizes training-time threats (e.g., backdoor attacks such as BadVLA, DropVLA, SilentDrift) and inference-time attacks (e.g., semantic jailbreaks achieving up to 100% attack success rates per RoboPAIR), catalogs defenses spanning constrained safety alignment (SafeVLA), human-in-the-loop refinement (APO, Hi-ORS), and dual-loop runtime architectures, and identifies open problems including certified robustness for embodied trajectories, physically realizable defenses, and standardized evaluation.
Key Points
- The survey frames VLA safety as an embodied, multimodal problem involving visual, language, proprioceptive state, and action vulnerabilities—distinct from prompt-level LLM alignment—and documents specific attack methods (e.g., BadVLA, DropVLA, GoBA, SilentDrift) that exploit cross-modal alignment, physical triggers, temporal action chunking, and state-space poisoning.
- It proposes a unified taxonomy linking training-time and inference-time attacks with corresponding training-time and inference-time defenses, and identifies coverage gaps such as the lack of certified robustness methods for embodied trajectories and the absence of unified runtime safety architectures.
- It reviews 16+ safety benchmarks and multiple metric categories (task-level, behavioral, robustness, composite), highlighting that current VLA models exhibit critical weaknesses including rejection rates as low as 10% for hazardous instructions and average success rates below 13% under systematic perturbation testing.