Related papers: VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents

VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents

URL: http://arxiv.org/abs/2601.16973v1
Date: Fri, 23 Jan 2026 18:43:34 GMT
Title: VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents
Authors: Zirui Wang, Junyi Zhang, Jiaxin Ge, Long Lian, Letian Fu, Lisa Dunlap, Ken Goldberg, XuDong Wang, Ion Stoica, David M. Chan, Sewon Min, Joseph E. Gonzalez,
Abstract summary: We introduce VisGym, a gymnasium of 17 environments for evaluating and training Modern Vision-Language Models (VLMs)<n>The suite spans symbolic puzzles, real-image understanding, navigation, and manipulation, and provides flexible controls over difficulty, input representation, planning horizon, and feedback.<n>Our evaluations show that all frontier models struggle in interactive settings, achieving low success rates in both the easy (46.6%) and hard (26.0%) configurations.
Score: 96.01507640637534
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern Vision-Language Models (VLMs) remain poorly characterized in multi-step visual interactions, particularly in how they integrate perception, memory, and action over long horizons. We introduce VisGym, a gymnasium of 17 environments for evaluating and training VLMs. The suite spans symbolic puzzles, real-image understanding, navigation, and manipulation, and provides flexible controls over difficulty, input representation, planning horizon, and feedback. We also provide multi-step solvers that generate structured demonstrations, enabling supervised finetuning. Our evaluations show that all frontier models struggle in interactive settings, achieving low success rates in both the easy (46.6%) and hard (26.0%) configurations. Our experiments reveal notable limitations: models struggle to effectively leverage long context, performing worse with an unbounded history than with truncated windows. Furthermore, we find that several text-based symbolic tasks become substantially harder once rendered visually. However, explicit goal observations, textual feedback, and exploratory demonstrations in partially observable or unknown-dynamics settings for supervised finetuning yield consistent gains, highlighting concrete failure modes and pathways for improving multi-step visual decision-making. Code, data, and models can be found at: https://visgym.github.io/.

Related papers

Vision-Language Models Unlock Task-Centric Latent Actions [75.53481518882275]
We propose to utilize the common-sense reasoning abilities of Vision-Language Models (VLMs) to provide promptable representations.<n>We show that simply asking VLMs to ignore distractors can substantially improve latent action quality, yielding up to a six-fold increase in downstream success rates on Distracting MetaWorld.
arXiv Detail & Related papers (2026-01-30T08:38:59Z)
Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding [43.63398524449102]
Humans perceive complex scenes efficiently by dynamically scanning and focusing on salient regions in a sequential "blink-like" process.<n>We propose Blink, a dynamic visual token resolution framework that emulates the human-inspired process within a single forward pass.<n>Blink balances broad exploration and fine-grained focus, thereby enhancing visual perception adaptively and efficiently.
arXiv Detail & Related papers (2025-12-11T11:27:25Z)
Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models [58.91911788912665]
We propose Latent Visual Reconstruction (LaVer), a novel training framework that facilitates MLLMs in learning more discrimi visual representations.<n>Our method offers direct visual activation to MLLMs, which exhibit increased visual attention allocation, indicating enhanced utilization of visual information.
arXiv Detail & Related papers (2025-12-06T04:20:13Z)
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z)
V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models [84.27290155010533]
We introduce Vision-centric Multiple Abilities Game Evaluation (V-MAGE), a novel game-based evaluation framework.<n>V-MAGE features five distinct video games comprising over 30 carefully constructed evaluation scenarios.<n>We show V-MAGE provides actionable insights for improving the visual and reasoning capabilities of MLLMs in dynamic, interactive settings.
arXiv Detail & Related papers (2025-04-08T15:43:01Z)
Seeing is Believing? Enhancing Vision-Language Navigation using Visual Perturbations [41.5875455113941]
We investigate whether advanced VLN models genuinely comprehend the visual content of their environments.<n>Surprisingly, we experimentally find that simple branch expansion, even with noisy visual inputs, paradoxically improves the navigational efficacy.<n>We present a versatile Multi-Branch Architecture (MBA) designed to delve into the impact of both the branch quantity and visual quality.
arXiv Detail & Related papers (2024-09-09T12:17:38Z)
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z)
OLIVE: Object Level In-Context Visual Embeddings [8.168219870640318]
We propose a novel method to prompt large language models with in-context visual object vectors. This eliminates the necessity of fusing a lengthy array of image patch features and significantly speeds up training. Our experiments reveal that our method achieves competitive referring object classification and captioning performance.
arXiv Detail & Related papers (2024-06-02T21:36:31Z)
CogCoM: A Visual Language Model with Chain-of-Manipulations Reasoning [61.21923643289266]
Chain of Manipulations is a mechanism that enables Vision-Language Models to solve problems step-by-step with evidence.<n>After training, models can solve various visual problems by eliciting intrinsic manipulations (e.g., grounding, zoom in) actively without involving external tools.<n>Our trained model, textbfCogCoM, achieves state-of-the-art performance across 9 benchmarks from 4 categories.
arXiv Detail & Related papers (2024-02-06T18:43:48Z)
Visual Adversarial Imitation Learning using Variational Models [60.69745540036375]
Reward function specification remains a major impediment for learning behaviors through deep reinforcement learning. Visual demonstrations of desired behaviors often presents an easier and more natural way to teach agents. We develop a variational model-based adversarial imitation learning algorithm.
arXiv Detail & Related papers (2021-07-16T00:15:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.