Related papers: A Study of Rule Omission in Raven's Progressive Matrices

A Study of Rule Omission in Raven's Progressive Matrices

URL: http://arxiv.org/abs/2510.03127v1
Date: Fri, 03 Oct 2025 15:53:28 GMT
Title: A Study of Rule Omission in Raven's Progressive Matrices
Authors: Binze Li,
Abstract summary: Analogical reasoning lies at the core of human cognition and remains a fundamental challenge for artificial intelligence.<n>This study investigates the generalization capacity of modern AI systems under conditions of incomplete training.<n>Experiments reveal that although transformers demonstrate strong performance on familiar rules, their accuracy declines sharply when faced with novel or omitted rules.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Analogical reasoning lies at the core of human cognition and remains a fundamental challenge for artificial intelligence. Raven's Progressive Matrices (RPM) serve as a widely used benchmark to assess abstract reasoning by requiring the inference of underlying structural rules. While many vision-based and language-based models have achieved success on RPM tasks, it remains unclear whether their performance reflects genuine reasoning ability or reliance on statistical shortcuts. This study investigates the generalization capacity of modern AI systems under conditions of incomplete training by deliberately omitting several structural rules during training. Both sequence-to-sequence transformer models and vision-based architectures such as CoPINet and the Dual-Contrast Network are evaluated on the Impartial-RAVEN (I-RAVEN) dataset. Experiments reveal that although transformers demonstrate strong performance on familiar rules, their accuracy declines sharply when faced with novel or omitted rules. Moreover, the gap between token-level accuracy and complete answer accuracy highlights fundamental limitations in current approaches. These findings provide new insights into the reasoning mechanisms underlying deep learning models and underscore the need for architectures that move beyond pattern recognition toward robust abstract reasoning.

Related papers

Native Reasoning Models: Training Language Models to Reason on Unverifiable Data [16.065264121785294]
We introduce NRT (Native Reasoning Training), a novel framework that cultivates complex reasoning.<n>NRT reframes the training problem by treating the reasoning process as a latent variable.<n>NRT achieves state-of-the-art performance among verifier-free methods.
arXiv Detail & Related papers (2026-02-12T04:15:46Z)
How and Why LLMs Generalize: A Fine-Grained Analysis of LLM Reasoning from Cognitive Behaviors to Low-Level Patterns [51.02752099869218]
Large Language Models (LLMs) display strikingly different generalization behaviors.<n>We introduce a novel benchmark that decomposes reasoning into atomic core skills.<n>We show that RL-tuned models maintain more stable behavioral profiles and resist collapse in reasoning skills, whereas SFT models exhibit sharper drift and overfit to surface patterns.
arXiv Detail & Related papers (2025-12-30T08:16:20Z)
Optimizing Generative Ranking Relevance via Reinforcement Learning in Xiaohongshu Search [32.56725829132154]
We investigate whether explicit reasoning can enhance both interpretability and performance in relevance modeling.<n>In this work, we formulate relevance modeling in Xiaohongshu search as a reasoning task.<n>We introduce a Reinforcement Learning (RL)-based training framework to enhance the grounded reasoning capabilities of GRMs.
arXiv Detail & Related papers (2025-11-30T16:31:16Z)
STaR: Towards Cognitive Table Reasoning via Slow-Thinking Large Language Models [12.745473719032026]
We present STaR (slow-thinking for table reasoning), a new framework achieving cognitive table reasoning.<n> STaR explicitly modeling step-by-step thinking and uncertainty-aware inference.<n>Experiments on benchmarks demonstrate that STaR achieves superior performance and enhanced reasoning stability.
arXiv Detail & Related papers (2025-11-14T12:34:17Z)
LTD-Bench: Evaluating Large Language Models by Letting Them Draw [57.237152905238084]
LTD-Bench is a breakthrough benchmark for large language models (LLMs)<n>It transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code.<n> LTD-Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.
arXiv Detail & Related papers (2025-11-04T08:11:23Z)
VAR: Visual Attention Reasoning via Structured Search and Backtracking [49.427842994857635]
We introduce Visual Attention Reasoning, a framework that recasts grounded reasoning as a structured search.<n> VAR decomposes the reasoning process into two key stages: traceable evidence grounding and search-based chain-of-thought.<n>We show that our 7B model, VAR-7B, sets a new state-of-the-art on a comprehensive suite of hallucination and safety benchmarks.
arXiv Detail & Related papers (2025-10-21T13:18:44Z)
RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark [71.3555284685426]
We introduce RealUnify, a benchmark designed to evaluate bidirectional capability synergy.<n>RealUnify comprises 1,000 meticulously human-annotated instances spanning 10 categories and 32 subtasks.<n>We find that current unified models still struggle to achieve effective synergy, indicating that architectural unification alone is insufficient.
arXiv Detail & Related papers (2025-09-29T15:07:28Z)
How LLMs Learn to Reason: A Complex Network Perspective [14.638878448692493]
Training large language models with Reinforcement Learning from Verifiable Rewards exhibits a set of puzzling behaviors.<n>We propose that these seemingly disparate phenomena can be explained using a single unifying theory.<n>Our work provides a new physical intuition for engineering the emergent reasoning capabilities of future AI systems.
arXiv Detail & Related papers (2025-09-28T04:10:37Z)
Characteristic Root Analysis and Regularization for Linear Time Series Forecasting [9.254995889539716]
Time series forecasting remains a critical challenge across numerous domains.<n>Recent studies highlight the surprising competitiveness of simple linear models.<n>This paper focuses on the role of characteristic roots in temporal dynamics.
arXiv Detail & Related papers (2025-09-28T03:06:30Z)
Sycophancy Mitigation Through Reinforcement Learning with Uncertainty-Aware Adaptive Reasoning Trajectories [58.988535279557546]
We introduce textbf sycophancy Mitigation through Adaptive Reasoning Trajectories.<n>We show that SMART significantly reduces sycophantic behavior while preserving strong performance on out-of-distribution inputs.
arXiv Detail & Related papers (2025-09-20T17:09:14Z)
Reasoning Model Unlearning: Forgetting Traces, Not Just Answers, While Preserving Reasoning Skills [32.96074934023323]
Large reasoning models (LRMs) have enabled strong chain-of-thought (CoT) generation through test-time computation.<n>We show that conventional unlearning algorithms, originally designed for non-reasoning models, are inadequate for LRMs.<n>We propose Reasoning-aware Representation Misdirection for Unlearning ($R2MU$), a novel method that effectively suppresses sensitive reasoning traces.
arXiv Detail & Related papers (2025-06-15T20:54:23Z)
Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective [59.7140089198992]
We develop a mathematic framework that defines abstract reasoning as the ability to extract essential patterns.<n>We introduce two novel complementary metrics: (scoreGamma) measures basic reasoning accuracy, while (scoreDelta) quantifies a model's reliance on specific symbols.
arXiv Detail & Related papers (2025-05-28T09:02:45Z)
Two Experts Are All You Need for Steering Thinking: Reinforcing Cognitive Effort in MoE Reasoning Models Without Additional Training [86.70255651945602]
We introduce a novel inference-time steering methodology called Reinforcing Cognitive Experts (RICE)<n>RICE aims to improve reasoning performance without additional training or complexs.<n> Empirical evaluations with leading MoE-based LRMs demonstrate noticeable and consistent improvements in reasoning accuracy, cognitive efficiency, and cross-domain generalization.
arXiv Detail & Related papers (2025-05-20T17:59:16Z)
Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage.<n>Models may behave unreliably due to poorly explored failure modes.<n> causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z)
On the Reasoning Capacity of AI Models and How to Quantify It [0.0]
Large Language Models (LLMs) have intensified the debate surrounding the fundamental nature of their reasoning capabilities.<n>While achieving high performance on benchmarks such as GPQA and MMLU, these models exhibit limitations in more complex reasoning tasks.<n>We propose a novel phenomenological approach that goes beyond traditional accuracy metrics to probe the underlying mechanisms of model behavior.
arXiv Detail & Related papers (2025-01-23T16:58:18Z)
Is it the model or the metric -- On robustness measures of deeplearning models [2.8169948004297565]
We revisit robustness investigating the sufficiency of robust accuracy (RA) within the context of deepfake detection.<n>We present a comparison of RA and RR and demonstrate that despite similar RA between models, the models show varying RR under different tolerance (perturbation) levels.
arXiv Detail & Related papers (2024-12-13T02:26:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.