Reasoning under Vision: Understanding Visual-Spatial Cognition in Vision-Language Models for CAPTCHA
- URL: http://arxiv.org/abs/2510.06067v1
- Date: Tue, 07 Oct 2025 15:56:21 GMT
- Title: Reasoning under Vision: Understanding Visual-Spatial Cognition in Vision-Language Models for CAPTCHA
- Authors: Python Song, Luke Tenyi Chang, Yun-Yun Tsai, Penghui Li, Junfeng Yang,
- Abstract summary: We show that step-by-step reasoning is crucial for vision-language models to solve CAPTCHAs.<n>We introduce CAPTCHA-X, the first real-world benchmark with reasoning.<n>Our method achieves state-of-the-art performance across five high-difficulty CAPTCHA types, with an average solving accuracy of 83.9 percent.
- Score: 21.107646541203387
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: CAPTCHA, originally designed to distinguish humans from robots, has evolved into a real-world benchmark for assessing the spatial reasoning capabilities of vision-language models. In this work, we first show that step-by-step reasoning is crucial for vision-language models (VLMs) to solve CAPTCHAs, which represent high-difficulty spatial reasoning tasks, and that current commercial vision-language models still struggle with such reasoning. In particular, we observe that most commercial VLMs (e.g., Gemini, Claude, GPT, etc.) fail to effectively solve CAPTCHAs and thus achieve low accuracy (around 21.9 percent). However, our findings indicate that requiring the model to perform step-by-step reasoning before generating the final coordinates can significantly enhance its solving accuracy, underscoring the severity of the gap. To systematically study this issue, we introduce CAPTCHA-X, the first real-world CAPTCHA benchmark with reasoning, covering seven categories of CAPTCHAs (such as Gobang, hCaptcha, etc.) with step-by-step action solutions and grounding annotations. We further define five reasoning-oriented metrics that enable a comprehensive evaluation of models reasoning capabilities. To validate the effectiveness of reasoning, we also propose a general agentic VLM-based framework that incorporates the models inherent reasoning abilities. Our method achieves state-of-the-art performance across five high-difficulty CAPTCHA types, with an average solving accuracy of 83.9 percent, substantially surpassing existing baselines. These results reveal the limitations of current models and highlight the importance of reasoning in advancing visual-spatial challenges in the future.
Related papers
- LVLM-Aided Alignment of Task-Specific Vision Models [49.96265491629163]
Small task-specific vision models are crucial in high-stakes domains.<n>We introduce a novel and efficient method for aligning small task-specific vision models with human domain knowledge.<n>Our method demonstrates substantial improvement in aligning model behavior with human specifications.
arXiv Detail & Related papers (2025-12-26T11:11:25Z) - COGNITION: From Evaluation to Defense against Multimodal LLM CAPTCHA Solvers [17.70082722524941]
multimodal large language models (MLLMs) undermine the security guarantees of visual CAPTCHA.<n>We evaluate 7 leading commercial and open-source MLLMs across 18 real-world CAPTCHA task types.<n>We reveal that MLLMs can reliably solve recognition-oriented and low-interaction CAPTCHA tasks at human-like cost and latency.
arXiv Detail & Related papers (2025-12-02T01:23:10Z) - CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding [13.628041236679229]
Vision Language Models (VLMs) have recently shown significant advancements in video understanding, but their capability for counterfactual reasoning remains underexplored.<n>We introduce CounterVQA, a video-based benchmark featuring three progressive difficulty levels that assess different aspects of counterfactual reasoning.<n>We develop a post-training method, CFGPT, that enhances a model's visual counterfactual reasoning ability by distilling its counterfactual reasoning capability from the language modality.
arXiv Detail & Related papers (2025-11-25T04:59:55Z) - Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing [70.35701681177655]
Self-improvement has emerged as a mainstream paradigm for advancing the reasoning capabilities of large vision-language models.<n>We introduce four efficient strategies to achieve head-tail re-balancing during the exploration-and-learning self-improvement process.<n>Our methods consistently improve visual reasoning capabilities, outperforming vanilla self-improvement by 3.86 points on average.
arXiv Detail & Related papers (2025-10-30T13:26:58Z) - Spatial CAPTCHA: Generatively Benchmarking Spatial Reasoning for Human-Machine Differentiation [15.668734718800065]
We present a novel human-verification framework that leverages fundamental differences in spatial reasoning between humans and MLLMs.<n>Unlike existing CAPTCHAs which rely on low-level perception tasks that are vulnerable to modern AI, Spatial CAPTCHA generates dynamic questions requiring geometric reasoning, perspective-taking, and mental rotation.<n> Evaluation on a corresponding benchmark, Spatial-CAPTCHA-Bench, demonstrates that humans vastly outperform 10 state-of-the-art MLLMs, with the best model achieving only 31.0% Pass@1 accuracy.
arXiv Detail & Related papers (2025-10-04T16:19:21Z) - Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models [63.69856480318313]
AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment.<n>We show that AGILE substantially boosts performance on jigsaw tasks of varying complexity.<n>We also demonstrate strong generalization across 9 general vision tasks, achieving an average improvement of 3.1%.
arXiv Detail & Related papers (2025-10-01T17:58:05Z) - Perception-Consistency Multimodal Large Language Models Reasoning via Caption-Regularized Policy Optimization [72.30168853571216]
multimodal large language models excel at tasks that integrate visual perception with symbolic reasoning.<n>CapPO integrates two key mechanisms: (1) a caption-based consistency regularization, which minimizes the divergence between responses conditioned on raw images and those conditioned on captions, and (2) a KL-weighted advantage estimation scheme, which adaptively scales reinforcement signals to strengthen perceptually consistent trajectories.
arXiv Detail & Related papers (2025-09-26T04:32:26Z) - Unveiling Chain of Step Reasoning for Vision-Language Models with Fine-grained Rewards [48.55501117313608]
We present chain of step reasoning for vision-language models, enabling assessing reasoning step quality accurately.<n>We present a simple, effective, and fully transparent framework, including the step-level reasoning data, process reward model (PRM), and reinforcement learning training.<n>We believe this paper serves as a baseline for vision-language models and offers insights into more complex multimodal reasoning.
arXiv Detail & Related papers (2025-09-23T13:47:32Z) - Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning [22.32435186013626]
We propose a causal representation learning framework wherein observed benchmark performance is modeled as a linear transformation of a few latent capability factors.<n>Applying this approach to a comprehensive dataset encompassing over 1500 models evaluated across six benchmarks, we identify a concise three-node linear causal structure that reliably explains the observed performance variations.
arXiv Detail & Related papers (2025-06-12T06:07:42Z) - Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability [53.51560766150442]
Critical tokens are elements within reasoning trajectories that significantly influence incorrect outcomes.<n>We present a novel framework for identifying these tokens through rollout sampling.<n>We show that identifying and replacing critical tokens significantly improves model accuracy.
arXiv Detail & Related papers (2024-11-29T18:58:22Z) - Oedipus: LLM-enchanced Reasoning CAPTCHA Solver [17.074422329618212]
Oedipus is an innovative end-to-end framework for automated reasoning CAPTCHA solving.
Central to this framework is a novel strategy that dissects the complex and human-easy-AI-hard tasks into a sequence of simpler and AI-easy steps.
Our evaluation shows that Oedipus effectively resolves the studied CAPTCHAs, achieving an average success rate of 63.5%.
arXiv Detail & Related papers (2024-05-13T06:32:57Z) - Advancing Spatial Reasoning in Large Language Models: An In-Depth
Evaluation and Enhancement Using the StepGame Benchmark [4.970614891967042]
We analyze GPT's spatial reasoning performance on the StepGame benchmark.
We identify proficiency in mapping natural language text to spatial relations but limitations in multi-hop reasoning.
We deploy Chain-of-thought and Tree-of-thoughts prompting strategies, offering insights into GPT's cognitive process"
arXiv Detail & Related papers (2024-01-08T16:13:08Z) - MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation [60.65820977963331]
We introduce a novel evaluation paradigm for Large Language Models (LLMs)
This paradigm shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation.
By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark.
arXiv Detail & Related papers (2023-12-28T15:49:43Z) - Deep-CAPTCHA: a deep learning based CAPTCHA solver for vulnerability
assessment [1.027974860479791]
This research investigates the weaknesses and vulnerabilities of the CAPTCHA generator systems.
We develop a Convolutional Neural Network called Deep-CAPTCHA to achieve this goal.
Our network's cracking accuracy leads to a high rate of 98.94% and 98.31% for the numerical and the alpha-numerical test datasets.
arXiv Detail & Related papers (2020-06-15T11:44:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.