CAPTURE: A Benchmark and Evaluation for LVLMs in CAPTCHA Resolving
- URL: http://arxiv.org/abs/2512.11323v1
- Date: Fri, 12 Dec 2025 06:50:27 GMT
- Title: CAPTURE: A Benchmark and Evaluation for LVLMs in CAPTCHA Resolving
- Authors: Jianyi Zhang, Ziyin Zhou, Xu Ji, Shizhao Liu, Zhangchi Zhao,
- Abstract summary: We introduce CAPTURE CAPTCHA benchmark for Large Visual Language Models (LVLMs)<n>Our benchmark encompasses 4 main CAPTCHA types and 25 sub-types from 31 vendors.<n>When evaluated by this benchmark, current LVLMs demonstrate poor performance in solving CAPTCHAs.
- Score: 10.62647293259843
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Benefiting from strong and efficient multi-modal alignment strategies, Large Visual Language Models (LVLMs) are able to simulate human visual and reasoning capabilities, such as solving CAPTCHAs. However, existing benchmarks based on visual CAPTCHAs still face limitations. Previous studies, when designing benchmarks and datasets, customized them according to their research objectives. Consequently, these benchmarks cannot comprehensively cover all CAPTCHA types. Notably, there is a dearth of dedicated benchmarks for LVLMs. To address this problem, we introduce a novel CAPTCHA benchmark for the first time, named CAPTURE CAPTCHA for Testing Under Real-world Experiments, specifically for LVLMs. Our benchmark encompasses 4 main CAPTCHA types and 25 sub-types from 31 vendors. The diversity enables a multi-dimensional and thorough evaluation of LVLM performance. CAPTURE features extensive class variety, large-scale data, and unique LVLM-tailored labels, filling the gaps in previous research in terms of data comprehensiveness and labeling pertinence. When evaluated by this benchmark, current LVLMs demonstrate poor performance in solving CAPTCHAs.
Related papers
- COGNITION: From Evaluation to Defense against Multimodal LLM CAPTCHA Solvers [17.70082722524941]
multimodal large language models (MLLMs) undermine the security guarantees of visual CAPTCHA.<n>We evaluate 7 leading commercial and open-source MLLMs across 18 real-world CAPTCHA task types.<n>We reveal that MLLMs can reliably solve recognition-oriented and low-interaction CAPTCHA tasks at human-like cost and latency.
arXiv Detail & Related papers (2025-12-02T01:23:10Z) - Reasoning under Vision: Understanding Visual-Spatial Cognition in Vision-Language Models for CAPTCHA [21.107646541203387]
We show that step-by-step reasoning is crucial for vision-language models to solve CAPTCHAs.<n>We introduce CAPTCHA-X, the first real-world benchmark with reasoning.<n>Our method achieves state-of-the-art performance across five high-difficulty CAPTCHA types, with an average solving accuracy of 83.9 percent.
arXiv Detail & Related papers (2025-10-07T15:56:21Z) - CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning [90.19455861166745]
We introduce Captioning Reinforcement Learning (CapRL), a training framework that redefines caption quality through its utility.<n>As the first study to apply RLVR to the subjective image captioning task, we demonstrate that CapRL significantly enhances multiple settings.<n>CapRL achieves performance comparable to Qwen2.5-VL-72B, while exceeding the baseline by an average margin of 8.4%.
arXiv Detail & Related papers (2025-09-26T17:59:55Z) - ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models [67.75439511654078]
Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses.<n>They face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world applications.<n>We propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment.
arXiv Detail & Related papers (2025-07-01T16:01:08Z) - MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks [13.493337474908316]
MCA-Bench is a comprehensive and reproducible benchmarking suite.<n>It integrates heterogeneous CAPTCHA types into a single evaluation protocol.<n>Extensive experiments reveal that MCA-Bench effectively maps the vulnerability spectrum of modern CAPTCHA designs.
arXiv Detail & Related papers (2025-06-06T11:02:01Z) - Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents [23.715342148854006]
Open CaptchaWorld is the first web-based benchmark and platform specifically designed to evaluate the visual reasoning and interaction capabilities of MLLM-powered agents.<n>Results show that humans consistently achieve near-perfect scores, state-of-the-art MLLM agents struggle significantly, with success rates at most 40.0% by Browser-Use Openai-o3.<n>This highlights Open CaptchaWorld as a vital benchmark for diagnosing the limits of current multimodal agents and guiding the development of more robust multimodal reasoning systems.
arXiv Detail & Related papers (2025-05-30T17:59:55Z) - IllusionCAPTCHA: A CAPTCHA based on Visual Illusion [14.043017273813227]
We present IllusionCAPTCHA, a novel security mechanism employing the "Human-Easy but AI-Hard" paradigm.<n>Results from our user study indicate that 86.95% of participants successfully passed the CAPTCHA on their first attempt, outperforming other CAPTCHA systems.
arXiv Detail & Related papers (2025-02-08T06:03:03Z) - AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? [65.92331309449015]
We introduce AutoBench-V, an automated framework for serving evaluation on demand, i.e., benchmarking LVLMs based on specific aspects of model capability.<n>Through an extensive evaluation of nine popular LVLMs across five demanded user inputs, the framework shows effectiveness and reliability.
arXiv Detail & Related papers (2024-10-28T17:55:08Z) - Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs [61.01278660925202]
Dysca is a dynamic and scalable benchmark for evaluating LVLMs by leveraging synthesis images.<n>We consider 51 kinds of image styles and evaluate the perception capability in 20 subtasks.<n>Dysca serves as a scalable benchmark for easily adding new subtasks and scenarios.
arXiv Detail & Related papers (2024-06-27T02:40:35Z) - ReForm-Eval: Evaluating Large Vision Language Models via Unified
Re-Formulation of Task-Oriented Benchmarks [76.25209974199274]
Large vision-language models (LVLMs) exhibit surprising capabilities to perceive visual signals and perform visually grounded reasoning.
Our benchmark and evaluation framework will be open-sourced as a cornerstone for advancing the development of LVLMs.
arXiv Detail & Related papers (2023-10-04T04:07:37Z) - Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in
Vision-Language Models [76.410400238974]
We propose TTA with feedback to rectify the model output and prevent the model from becoming blindly confident.
A CLIP model is adopted as the reward model during TTA and provides feedback for the VLM.
The proposed textitreinforcement learning with CLIP feedback(RLCF) framework is highly flexible and universal.
arXiv Detail & Related papers (2023-05-29T11:03:59Z) - Robust Text CAPTCHAs Using Adversarial Examples [129.29523847765952]
We propose a user-friendly text-based CAPTCHA generation method named Robust Text CAPTCHA (RTC)
At the first stage, the foregrounds and backgrounds are constructed with randomly sampled font and background images.
At the second stage, we apply a highly transferable adversarial attack for text CAPTCHAs to better obstruct CAPTCHA solvers.
arXiv Detail & Related papers (2021-01-07T11:03:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.