Related papers: Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments

Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments

URL: http://arxiv.org/abs/2505.05540v2
Date: Tue, 17 Jun 2025 03:16:46 GMT
Title: Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments
Authors: Pranav Guruprasad, Yangyue Wang, Sudipta Chowdhury, Harshvardhan Sikka, Paul Pu Liang,
Abstract summary: Vision-language-action (VLA) models represent an important step toward general-purpose robotic systems.<n>We introduce MultiNet v0.2, a benchmark designed to evaluate and analyze the generalization performance of state-of-the-art VLMs andVLAs.
Score: 20.93006455952299
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language-action (VLA) models represent an important step toward general-purpose robotic systems by integrating visual perception, language understanding, and action execution. However, systematic evaluation of these models, particularly their zero-shot generalization capabilities in procedurally out-of-distribution (OOD) environments, remains limited. In this paper, we introduce MultiNet v0.2, a comprehensive benchmark designed to evaluate and analyze the generalization performance of state-of-the-art VLMs and VLAs - including GPT-4o, GPT-4.1, OpenVLA, Pi0 Base, and Pi0 FAST - on diverse procedural tasks from the Procgen benchmark. Our analysis reveals several critical insights: (1) all evaluated models exhibit significant limitations in zero-shot generalization to OOD tasks, with performance heavily influenced by factors such as action representation and task complexity; (2) VLAs generally outperforms other models due to their robust architectural design; and (3) VLM variants demonstrate substantial improvements when constrained appropriately, highlighting the sensitivity of model performance to precise prompt engineering. We release our benchmark, evaluation framework, and findings to enable the assessment of future VLA models and identify critical areas for improvement in their application to out-of-distribution digital tasks.

Related papers

Adapting Vision-Language Models for Evaluating World Models [24.813041196394582]
We present UNIVERSE, a method for adapting Vision-language Evaluator for Rollouts in Simulated Environments under data and compute constraints.<n>We conduct a large-scale study comparing full, partial, and parameter-efficient finetuning across task formats, context lengths, sampling strategies, and data compositions.<n>The resulting unified evaluator matches the performance of task-specific baselines using a single checkpoint.
arXiv Detail & Related papers (2025-06-22T09:53:28Z)
Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation [53.84282335629258]
We introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 1.01 million questions and 0.33 million images.<n>Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives.<n>We uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance.
arXiv Detail & Related papers (2025-04-21T09:30:41Z)
Sustainability via LLM Right-sizing [21.17523328451591]
Large language models (LLMs) have become increasingly embedded in organizational.<n>This study offers an empirical answer by evaluating eleven proprietary and open-weight LLMs across ten everyday occupational tasks.<n>Results show that GPT-4o delivers consistently superior performance but at a significantly higher cost and environmental footprint.
arXiv Detail & Related papers (2025-04-17T04:00:40Z)
Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation [38.20492321295552]
Vision-Language Model (VLM) have gained widespread adoption in Open-Vocabulary (OV) object detection and segmentation tasks.<n>Despite they have shown promise on OV-related tasks, their effectiveness in conventional vision tasks has thus far been unevaluated.
arXiv Detail & Related papers (2025-04-13T08:28:13Z)
VACT: A Video Automatic Causal Testing System and a Benchmark [55.53300306960048]
VACT is an **automated** framework for modeling, evaluating, and measuring the causal understanding of VGMs in real-world scenarios.<n>We introduce multi-level causal evaluation metrics to provide a detailed analysis of the causal performance of VGMs.
arXiv Detail & Related papers (2025-03-08T10:54:42Z)
A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs [3.2228025627337864]
This paper introduces a structured evaluation framework to dissect the perception-reasoning interface in Vision-Language Models (VLMs)<n>We propose three distinct evaluation paradigms, mirroring human problem-solving strategies.<n>Applying this framework, we demonstrate that CA, leveraging powerful language models for reasoning over rich, independently generated descriptions, achieves new state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2025-01-23T12:42:42Z)
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation [100.25567121604382]
Vision-Language-Action (VLA) models have improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios.<n>We present a new advanced VLA architecture derived from Vision-Language-Models (VLM)<n>We show that our model not only significantly surpasses existing VLAs in task performance and but also exhibits remarkable adaptation to new robots and generalization to unseen objects and backgrounds.
arXiv Detail & Related papers (2024-11-29T12:06:03Z)
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.<n>In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.<n>This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z)
Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks [20.93006455952299]
Vision-language-action (VLA) models represent a promising direction for developing general-purpose robotic systems.<n>We present a comprehensive evaluation framework and benchmark suite for assessing VLA models.
arXiv Detail & Related papers (2024-11-04T18:01:34Z)
VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM) VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z)
Large Language Model Evaluation Via Multi AI Agents: Preliminary results [3.8066447473175304]
We introduce a novel multi-agent AI model that aims to assess and compare the performance of various Large Language Models (LLMs) Our model consists of eight distinct AI agents, each responsible for retrieving code based on a common description from different advanced language models. We integrate the HumanEval benchmark into our verification agent to assess the generated code's performance, providing insights into their respective capabilities and efficiencies.
arXiv Detail & Related papers (2024-04-01T10:06:04Z)
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.