Experiences from Benchmarking Vision-Language-Action Models for Robotic Manipulation
- URL: http://arxiv.org/abs/2511.11298v1
- Date: Fri, 14 Nov 2025 13:35:30 GMT
- Title: Experiences from Benchmarking Vision-Language-Action Models for Robotic Manipulation
- Authors: Yihao Zhang, Yuankai Qi, Xi Zheng,
- Abstract summary: This paper reports our textbfempirical experiences from benchmarking four representative VLAs.<n>We establish a textbfstandardized evaluation framework that measures performance along three key dimensions.<n>We find practical trade-offs among VLA model architectures in balancing precision, generalization, and deployment cost.
- Score: 22.40720239761228
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Foundation models applied in robotics, particularly \textbf{Vision--Language--Action (VLA)} models, hold great promise for achieving general-purpose manipulation. Yet, systematic real-world evaluations and cross-model comparisons remain scarce. This paper reports our \textbf{empirical experiences} from benchmarking four representative VLAs -- \textbf{ACT}, \textbf{OpenVLA--OFT}, \textbf{RDT-1B}, and \boldmath{$π_0$} -- across four manipulation tasks conducted in both simulation and on the \textbf{ALOHA Mobile} platform. We establish a \textbf{standardized evaluation framework} that measures performance along three key dimensions: (1) \textit{accuracy and efficiency} (success rate and time-to-success), (2) \textit{adaptability} across in-distribution, spatial out-of-distribution, and instance-plus-spatial out-of-distribution settings, and (3) \textit{language instruction-following accuracy}. Through this process, we observe that \boldmath{$π_0$} demonstrates superior adaptability in out-of-distribution scenarios, while \textbf{ACT} provides the highest stability in-distribution. Further analysis highlights differences in computational demands, data-scaling behavior, and recurring failure modes such as near-miss grasps, premature releases, and long-horizon state drift. These findings reveal practical trade-offs among VLA model architectures in balancing precision, generalization, and deployment cost, offering actionable insights for selecting and deploying VLAs in real-world robotic manipulation tasks.
Related papers
- LIBERO-X: Robustness Litmus for Vision-Language-Action Models [32.29541801424534]
This work systematically rethinks VLA benchmarking from both evaluation and data perspectives.<n>We introduce LIBERO-X, a benchmark featuring a hierarchical evaluation protocol with progressive difficulty levels targeting three core capabilities.<n> Experiments with representative VLA models reveal significant performance drops under cumulative perturbations.
arXiv Detail & Related papers (2026-02-06T09:59:12Z) - D-Models and E-Models: Diversity-Stability Trade-offs in the Sampling Behavior of Large Language Models [91.21455683212224]
In large language models (LLMs), the probability of relevance for the next piece of information is linked to the probability of relevance for the next product.<n>But whether fine-grained sampling probabilities faithfully align with task requirements remains an open question.<n>We identify two model types: D-models, whose P_token exhibits large step-to-step variability and poor alignment with P_task; and E-models, whose P_token is more stable and better aligned with P_task.
arXiv Detail & Related papers (2026-01-25T14:59:09Z) - Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach [78.4812458793128]
We propose textbfTACO, a test-time-scaling framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks.<n>Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits.
arXiv Detail & Related papers (2025-12-02T14:42:54Z) - SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation [65.6201974979119]
We propose SemanticVLA, a novel VLA framework that performs Semantic-Hierarchical Sparsification and Enhancement for Efficient Robotic Manipulation.<n>SemanticVLA surpasses OpenVLA on LIBERO benchmark by 21.1% in success rate, while reducing training cost and inference latency by 3.0-fold and 2.7-fold.
arXiv Detail & Related papers (2025-11-13T17:24:37Z) - TRACE: Textual Reasoning for Affordance Coordinate Extraction [4.374024319540872]
Vision-Language Models (VLMs) struggle to translate high-level instructions into the precise spatial affordances required for robotic manipulation.<n>We introduce TRACE, a novel methodology that integrates a textual Chain of Reasoning into the affordance prediction process.<n>Our experiments show that our TRACE-tuned model achieves state-of-the-art performance, reaching 48.1% accuracy on the primary Where2Place benchmark.
arXiv Detail & Related papers (2025-11-03T19:13:26Z) - Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model [62.889356203346985]
We propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict.<n>DUST achieves up to 6% gains over a standard VLA baseline and implicit world-modeling methods.<n>On real-world tasks with the Franka Research 3, DUST outperforms baselines in success rate by 13%.
arXiv Detail & Related papers (2025-10-31T16:32:12Z) - Auto-Rubric: Learning to Extract Generalizable Criteria for Reward Modeling [37.237020102873]
Reward models are essential for aligning Large Language Models with human values, yet their development is hampered by costly preference datasets and poor interpretability.<n>We build a training-free framework that infers high-quality, query-specific rubrics using a validation-guided textbfPropose-Evaluate-Revise pipeline.<n>Using just 70 preference pairs (1.5% of the source data), our method also empowers smaller models like Qwen3-8B to outperform specialized, fully-trained counterparts.
arXiv Detail & Related papers (2025-10-20T09:01:37Z) - Continual Action Quality Assessment via Adaptive Manifold-Aligned Graph Regularization [53.82400605816587]
Action Quality Assessment (AQA) quantifies human actions in videos, supporting applications in sports scoring, rehabilitation, and skill evaluation.<n>A major challenge lies in the non-stationary nature of quality distributions in real-world scenarios.<n>We introduce Continual AQA (CAQA), which equips with Continual Learning capabilities to handle evolving distributions.
arXiv Detail & Related papers (2025-10-08T10:09:47Z) - Accelerate Scaling of LLM Finetuning via Quantifying the Coverage and Depth of Instruction Set [37.26992936545316]
Scaling the amount of data used for supervied fine-tuning(SFT) does not guarantee the proportional gains in model performance.<n>This work identifies two fundamental dataset properties that govern SFT scalability.<n>We propose the textbfInformation Landscape Approximation (ILA), a model-agnostic data selection framework.
arXiv Detail & Related papers (2025-09-08T09:22:57Z) - Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance [63.33213516925946]
We introduce textbfAlign-Then-stEer (textttATE), a novel, data-efficient, and plug-and-play adaptation framework.<n>Our work presents a general and lightweight solution that greatly enhances the practicality of deploying VLA models to new robotic platforms and tasks.
arXiv Detail & Related papers (2025-09-02T07:51:59Z) - Test-Time Consistency in Vision Language Models [26.475993408532304]
Vision-Language Models (VLMs) have achieved impressive performance across a wide range of multimodal tasks.<n>Recent benchmarks, such as MM-R3, highlight that even state-of-the-art VLMs can produce divergent predictions across semantically equivalent inputs.<n>We propose a simple and effective test-time consistency framework that enhances semantic consistency without supervised re-training.
arXiv Detail & Related papers (2025-06-27T17:09:44Z) - CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling [84.51372201195132]
CronusVLA is a unified framework that extends single-frame VLA models to the multi-frame paradigm.<n>CronusVLA achieves leading performance and superior robustness, with a 70.9% success rate.<n>These results highlight the potential of efficient multi-frame adaptation in VLA models for more powerful and robust real-world deployment.
arXiv Detail & Related papers (2025-06-24T17:30:27Z) - From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation [35.79160868966466]
We propose FSD (From Seeing to Doing), a novel vision-language model that generates intermediate representations through spatial relationship reasoning.<n>Our approach combines a hierarchical data pipeline for training with a self-consistency mechanism that aligns spatial coordinates with visual signals.<n>We show that FSD achieves 40.6% success rate in SimplerEnv and 72% success rate across 8 real-world tasks, outperforming the strongest baseline by 30%.
arXiv Detail & Related papers (2025-05-13T13:20:46Z) - TD3: Tucker Decomposition Based Dataset Distillation Method for Sequential Recommendation [50.23504065567638]
This paper introduces textbfTD3, a novel textbfDataset textbfDistillation method within a meta-learning framework.<n> TD3 distills a fully expressive emphsynthetic sequence summary from original data.<n>An augmentation technique allows the learner to closely fit the synthetic summary, ensuring an accurate update of it in the emphouter-loop.
arXiv Detail & Related papers (2025-02-05T03:13:25Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.