Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents
- URL: http://arxiv.org/abs/2505.24878v1
- Date: Fri, 30 May 2025 17:59:55 GMT
- Title: Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents
- Authors: Yaxin Luo, Zhaoyi Li, Jiacheng Liu, Jiacheng Cui, Xiaohan Zhao, Zhiqiang Shen,
- Abstract summary: Open CaptchaWorld is the first web-based benchmark and platform specifically designed to evaluate the visual reasoning and interaction capabilities of MLLM-powered agents.<n>Results show that humans consistently achieve near-perfect scores, state-of-the-art MLLM agents struggle significantly, with success rates at most 40.0% by Browser-Use Openai-o3.<n>This highlights Open CaptchaWorld as a vital benchmark for diagnosing the limits of current multimodal agents and guiding the development of more robust multimodal reasoning systems.
- Score: 23.715342148854006
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: CAPTCHAs have been a critical bottleneck for deploying web agents in real-world applications, often blocking them from completing end-to-end automation tasks. While modern multimodal LLM agents have demonstrated impressive performance in static perception tasks, their ability to handle interactive, multi-step reasoning challenges like CAPTCHAs is largely untested. To address this gap, we introduce Open CaptchaWorld, the first web-based benchmark and platform specifically designed to evaluate the visual reasoning and interaction capabilities of MLLM-powered agents through diverse and dynamic CAPTCHA puzzles. Our benchmark spans 20 modern CAPTCHA types, totaling 225 CAPTCHAs, annotated with a new metric we propose: CAPTCHA Reasoning Depth, which quantifies the number of cognitive and motor steps required to solve each puzzle. Experimental results show that humans consistently achieve near-perfect scores, state-of-the-art MLLM agents struggle significantly, with success rates at most 40.0% by Browser-Use Openai-o3, far below human-level performance, 93.3%. This highlights Open CaptchaWorld as a vital benchmark for diagnosing the limits of current multimodal agents and guiding the development of more robust multimodal reasoning systems. Code and Data are available at this https URL.
Related papers
- WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code [57.45181837786448]
Multimodal Large Language Models (MLLMs) have the potential to act as AI software engineers capable of executing complex web application development.<n>Existing benchmarks usually fail to provide an assessment of sub-capabilities and focus solely on webpage generation outcomes.<n>We propose WebUIBench, a benchmark systematically designed to evaluate MLLMs in four key areas: WebUI Perception, HTML Programming,WebUI-HTML Understanding, and WebUI-to-Code.
arXiv Detail & Related papers (2025-06-09T14:46:02Z) - MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks [7.293357145882387]
MCA-Bench is a comprehensive and reproducible benchmarking suite.<n>It integrates heterogeneous CAPTCHA types into a single evaluation protocol.<n>Extensive experiments reveal that MCA-Bench effectively maps the vulnerability spectrum of modern CAPTCHA designs.
arXiv Detail & Related papers (2025-06-06T11:02:01Z) - VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.<n>We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.<n>We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z) - EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents [63.43699771428243]
EmbodiedBench is an extensive benchmark designed to evaluate vision-driven embodied agents.<n>We evaluated 19 leading proprietary and open-source MLLMs within EmbodiedBench.<n> MLLMs excel at high-level tasks but struggle with low-level manipulation.
arXiv Detail & Related papers (2025-02-13T18:11:34Z) - IllusionCAPTCHA: A CAPTCHA based on Visual Illusion [14.043017273813227]
We present IllusionCAPTCHA, a novel security mechanism employing the "Human-Easy but AI-Hard" paradigm.<n>Results from our user study indicate that 86.95% of participants successfully passed the CAPTCHA on their first attempt, outperforming other CAPTCHA systems.
arXiv Detail & Related papers (2025-02-08T06:03:03Z) - Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping [57.024913536420264]
Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance on the design-to-code task.<n>We present the first systematic investigation of MLLMs in generating interactive webpages.
arXiv Detail & Related papers (2024-11-05T17:40:03Z) - Oedipus: LLM-enchanced Reasoning CAPTCHA Solver [17.074422329618212]
Oedipus is an innovative end-to-end framework for automated reasoning CAPTCHA solving.
Central to this framework is a novel strategy that dissects the complex and human-easy-AI-hard tasks into a sequence of simpler and AI-easy steps.
Our evaluation shows that Oedipus effectively resolves the studied CAPTCHAs, achieving an average success rate of 63.5%.
arXiv Detail & Related papers (2024-05-13T06:32:57Z) - VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? [115.60866817774641]
Multimodal Large Language models (MLLMs) have shown promise in web-related tasks.
evaluating their performance in the web domain remains a challenge due to the lack of comprehensive benchmarks.
bench is a multimodal benchmark designed to assess the capabilities of MLLMs across a variety of web tasks.
arXiv Detail & Related papers (2024-04-09T02:29:39Z) - VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks [93.85005277463802]
VisualWebArena is a benchmark designed to assess the performance of multimodal web agents on realistic tasks.
To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives.
arXiv Detail & Related papers (2024-01-24T18:35:21Z) - EnSolver: Uncertainty-Aware Ensemble CAPTCHA Solvers with Theoretical Guarantees [1.9649272351760065]
We propose Enr, a family of solvers that use deep ensemble uncertainty to detect and skip out-of-distribution CAPTCHAs.
We prove novel theoretical bounds on the effectiveness of our solvers and demonstrate their use with state-of-the-art CAPTCHA solvers.
arXiv Detail & Related papers (2023-07-27T20:19:11Z) - Robust Text CAPTCHAs Using Adversarial Examples [129.29523847765952]
We propose a user-friendly text-based CAPTCHA generation method named Robust Text CAPTCHA (RTC)
At the first stage, the foregrounds and backgrounds are constructed with randomly sampled font and background images.
At the second stage, we apply a highly transferable adversarial attack for text CAPTCHAs to better obstruct CAPTCHA solvers.
arXiv Detail & Related papers (2021-01-07T11:03:07Z) - Deep-CAPTCHA: a deep learning based CAPTCHA solver for vulnerability
assessment [1.027974860479791]
This research investigates the weaknesses and vulnerabilities of the CAPTCHA generator systems.
We develop a Convolutional Neural Network called Deep-CAPTCHA to achieve this goal.
Our network's cracking accuracy leads to a high rate of 98.94% and 98.31% for the numerical and the alpha-numerical test datasets.
arXiv Detail & Related papers (2020-06-15T11:44:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.