Related papers: DrawingBench: Evaluating Spatial Reasoning and UI Interaction Capabilities of Large Language Models through Mouse-Based Drawing Tasks

DrawingBench: Evaluating Spatial Reasoning and UI Interaction Capabilities of Large Language Models through Mouse-Based Drawing Tasks

URL: http://arxiv.org/abs/2512.01174v1
Date: Mon, 01 Dec 2025 01:18:21 GMT
Title: DrawingBench: Evaluating Spatial Reasoning and UI Interaction Capabilities of Large Language Models through Mouse-Based Drawing Tasks
Authors: Hyunjun Kim, Sooyoung Ryu,
Abstract summary: DrawingBench is a verification framework for evaluating the trustworthiness of agentic LLMs.<n>Our framework comprises 250 diverse prompts across 20 categories and 4 difficulty levels.<n>We evaluate four state-of-the-art LLMs across 1,000 tests.
Score: 10.977990951788422
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As agentic AI systems increasingly operate autonomously, establishing trust through verifiable evaluation becomes critical. Yet existing benchmarks lack the transparency and auditability needed to assess whether agents behave reliably. We present DrawingBench, a verification framework for evaluating the trustworthiness of agentic LLMs through spatial reasoning tasks that require generating sequences of low-level GUI actions. Unlike opaque evaluations, DrawingBench provides transparent, rule-based assessment: 8 objective criteria enable reproducible scoring, while action-level inspection allows stakeholders to audit agent behavior. Our framework comprises 250 diverse prompts across 20 categories and 4 difficulty levels, deterministic evaluation metrics, and an external oversight mechanism through multi-turn feedback that enables human control over agent refinement. Evaluating four state-of-the-art LLMs (Claude-4 Sonnet, GPT-4.1, GPT-4.1-mini, Gemini-2.5 Flash) across 1,000 tests, we establish both capabilities and limitations: models achieved 92.8% perfect performance with structured external feedback driving significant improvements (average +3.2%, up to +32.8% for complex scenes), but systematic error patterns emerged in tool state management and long-horizon planning. Notably, specification clarity proved more important than task complexity -- models achieved 100% perfect performance when given explicit, verifiable criteria. These findings demonstrate that transparent evaluation frameworks can establish trust in agentic systems, with external oversight proving more reliable than self-correction for guiding agent behavior. Our open-source framework provides a template for trustworthy agent assessment. Code and data: https://github.com/hyunjun1121/DrawingBench

Related papers

ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices [17.39388308538324]
This paper introduces ProactiveMobile, a benchmark for proactive mobile agent development.<n>It formalizes the proactive task as inferring latent user intent across four dimensions of on-device contextual signals.<n>The benchmark achieves a success rate of 19.15%, outperforming o1 (15.71%) and GPT-5 (7.39%) in experiments.
arXiv Detail & Related papers (2026-02-25T12:32:37Z)
When Agents Fail to Act: A Diagnostic Framework for Tool Invocation Reliability in Multi-Agent LLM Systems [1.8717456484053328]
Multi-agent systems powered by large language models (LLMs) are transforming enterprise automation.<n>We introduce a comprehensive diagnostic framework that leverages big data analytics to evaluate procedural reliability in intelligent agent systems.<n>This work establishes foundational infrastructure for systematic reliability evaluation of tool-augmented AI systems.
arXiv Detail & Related papers (2026-01-22T19:24:21Z)
Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification [71.98473277917962]
Recent advances in Deep Research Agents (DRAs) are transforming automated knowledge discovery and problem-solving.<n>We propose an alternative paradigm: self-evolving the agent's ability by iteratively verifying the policy model's outputs, guided by meticulously crafted rubrics.<n>We present DeepVerifier, a rubrics-based outcome reward verifier that leverages the asymmetry of verification.
arXiv Detail & Related papers (2026-01-22T09:47:31Z)
Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation [76.5533899503582]
Large language models (LLMs) are increasingly used as judges to evaluate agent performance.<n>We show this paradigm implicitly assumes that the agent's chain-of-thought (CoT) reasoning faithfully reflects both its internal reasoning and the underlying environment state.<n>We demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks.
arXiv Detail & Related papers (2026-01-21T06:07:43Z)
Towards a Science of Scaling Agent Systems [79.64446272302287]
We formalize a definition for agent evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, modelic, and task properties.<n>We derive a predictive model using coordination metrics, that cross-validated R2=0, enabling prediction on unseen task domains.<n>We identify three effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead, and (2) a capability saturation: coordination yields diminishing or negative returns once single-agent baselines exceed 45%.
arXiv Detail & Related papers (2025-12-09T06:52:21Z)
CUARewardBench: A Benchmark for Evaluating Reward Models on Computer-using Agent [46.41047559759938]
Computer-using agents (CUAs) enable task completion through natural interaction with operating systems and software interfaces.<n> Reward models offer promising alternatives, but their effectiveness on CUA evaluation remains largely underexplored.<n>We present CUARewardBench, comprising four key contributions.
arXiv Detail & Related papers (2025-10-21T12:53:40Z)
VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents [130.70999337445468]
Key challenge in training Vision-Language Model (VLM) agents, compared to Language Model (LLM) agents, is shift from textual states to complex visual observations.<n>We ask: Can VLM agents construct internal world models through explicit visual state reasoning?<n>We architecturally enforce and reward the agent's reasoning process via reinforcement learning (RL)<n>We find that the agent's reasoning into State Estimation and Transition Modeling is critical for success.
arXiv Detail & Related papers (2025-10-19T16:05:07Z)
How can we assess human-agent interactions? Case studies in software agent design [52.953425368394306]
We make two major steps towards the rigorous assessment of human-agent interactions.<n>We propose PULSE, a framework for more efficient human-centric evaluation of agent designs.<n>We deploy the framework on a large-scale web platform built around the open-source software agent OpenHands.
arXiv Detail & Related papers (2025-10-10T19:04:28Z)
ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability [23.70973331911138]
We argue that usable reasoning systems must be trustworthy, characterized by three properties: interpretability, faithfulness, and reliability.<n>We propose ReFIne, a new training framework that integrates supervised fine-tuning with GRPO to encourage models to improve interpretability.<n>Our experimental results show that ReFIne models generate clearer and better-structured reasoning traces.
arXiv Detail & Related papers (2025-10-10T07:08:44Z)
Zero-shot reasoning for simulating scholarly peer-review [0.0]
We investigate a deterministic simulation framework that provides the first stable, evidence-based standard for evaluating AI-generated peer review reports.<n>First, the system is able to simulate calibrated editorial judgment, with 'Revise' decisions consistently forming the majority outcome.<n>Second, it maintains unwavering procedural integrity, enforcing a stable 29% evidence-anchoring compliance rate.
arXiv Detail & Related papers (2025-10-02T13:59:14Z)
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them [58.04324690859212]
Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks.<n>We identify two fundamental types of inconsistencies: Score-Comparison Inconsistency and Pairwise Transitivity Inconsistency.<n>We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations.
arXiv Detail & Related papers (2025-09-25T13:04:29Z)
JudgeAgent: Knowledge-wise and Dynamic LLM Evaluation with Agent-as-Interviewer [19.09571232466437]
We propose Agent-as-Interviewer, a dynamic evaluation paradigm for large language models (LLMs)<n>Unlike current benchmarking or dynamic interaction paradigms, Agent-as-Interviewer utilizes agents to invoke knowledge tools for wider and deeper knowledge in the dynamic multi-turn question generation.<n>We develop JudgeAgent, a knowledge-wise dynamic evaluation framework that employs knowledge-driven synthesis as the agent's tool and uses difficulty scoring as strategy guidance.
arXiv Detail & Related papers (2025-09-02T08:52:16Z)
Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications [0.7124971549479361]
This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification.<n>We determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability.
arXiv Detail & Related papers (2025-05-20T21:12:58Z)
SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints [59.645885492637845]
SOPBench is an evaluation pipeline that transforms each service-specific SOP code program into a directed graph of executable functions.<n>Our approach transforms each service-specific SOP code program into a directed graph of executable functions and requires agents to call these functions based on natural language SOP descriptions.<n>We evaluate 18 leading models, and results show the task is challenging even for top-tier models.
arXiv Detail & Related papers (2025-03-11T17:53:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.