Can Large Reasoning Models do Analogical Reasoning under Perceptual Uncertainty?
- URL: http://arxiv.org/abs/2503.11207v1
- Date: Fri, 14 Mar 2025 08:52:25 GMT
- Title: Can Large Reasoning Models do Analogical Reasoning under Perceptual Uncertainty?
- Authors: Giacomo Camposampiero, Michael Hersche, Roger Wattenhofer, Abu Sebastian, Abbas Rahimi,
- Abstract summary: We evaluate OpenAI's o3-mini and DeepSeek R1 on analogical reasoning.<n>We benchmark with the I-RAVEN dataset and its more difficult extension, I-RAVEN-X.<n>We observe a sharp decline in OpenAI's o3-mini task accuracy, dropping from 86.6% on the original I-RAVEN to just 17.0% -- approaching random chance -- on the more challenging I-RAVEN-X.
- Score: 20.72570252804897
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work presents a first evaluation of two state-of-the-art Large Reasoning Models (LRMs), OpenAI's o3-mini and DeepSeek R1, on analogical reasoning, focusing on well-established nonverbal human IQ tests based on Raven's progressive matrices. We benchmark with the I-RAVEN dataset and its more difficult extension, I-RAVEN-X, which tests the ability to generalize to longer reasoning rules and ranges of the attribute values. To assess the influence of visual uncertainties on these nonverbal analogical reasoning tests, we extend the I-RAVEN-X dataset, which otherwise assumes an oracle perception. We adopt a two-fold strategy to simulate this imperfect visual perception: 1) we introduce confounding attributes which, being sampled at random, do not contribute to the prediction of the correct answer of the puzzles and 2) smoothen the distributions of the input attributes' values. We observe a sharp decline in OpenAI's o3-mini task accuracy, dropping from 86.6% on the original I-RAVEN to just 17.0% -- approaching random chance -- on the more challenging I-RAVEN-X, which increases input length and range and emulates perceptual uncertainty. This drop occurred despite spending 3.4x more reasoning tokens. A similar trend is also observed for DeepSeek R1: from 80.6% to 23.2%. On the other hand, a neuro-symbolic probabilistic abductive model, ARLC, that achieves state-of-the-art performances on I-RAVEN, can robustly reason under all these out-of-distribution tests, maintaining strong accuracy with only a modest reduction from 98.6% to 88.0%. Our code is available at https://github.com/IBM/raven-large-language-models.
Related papers
- FLIP Reasoning Challenge [20.706469085872516]
This paper introduces the FLIP dataset, a benchmark for evaluating AI reasoning capabilities based on human verification tasks.
FLIP challenges present users with two orderings of 4 images, requiring them to identify the coherent one.
Our experiments evaluate state-of-the-art models, leveraging both vision-language models (VLMs) and large language models (LLMs)
arXiv Detail & Related papers (2025-04-16T17:07:16Z) - SEAL: Steerable Reasoning Calibration of Large Language Models for Free [58.190800043449336]
Large Language Models (LLMs) have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism.
Recent studies reveal substantial redundancy in the CoT reasoning traces, which negatively impacts model performance.
We introduce SEAL, a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains.
arXiv Detail & Related papers (2025-04-07T02:42:07Z) - R-PRM: Reasoning-Driven Process Reward Modeling [53.06844294668382]
Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step.
Existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy.
We propose Reasoning-Driven Process Reward Modeling (R-PRM)
R-PRM generates seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities.
arXiv Detail & Related papers (2025-03-27T09:23:08Z) - R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model [70.77691645678804]
We present the first successful replication of emergent characteristics for multimodal reasoning on only a non-SFT 2B model.<n>Our model achieves 59.47% accuracy on CVBench, outperforming the base model by approximately 30% and exceeding both SFT setting by 2%.<n>In addition, we share our failed attempts and insights in attempting to achieve R1-like reasoning using RL with instruct models.
arXiv Detail & Related papers (2025-03-07T04:21:47Z) - START: Self-taught Reasoner with Tools [51.38785489790888]
We introduce START (Self-Taught Reasoner with Tools), a tool-integrated long Chain-of-thought (CoT) reasoning LLM.<n> START is capable of performing complex computations, self-checking, exploring diverse methods, and self-ging.<n>It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B.
arXiv Detail & Related papers (2025-03-06T17:11:51Z) - Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z) - Visual Reasoning Evaluation of Grok, Deepseek Janus, Gemini, Qwen, Mistral, and ChatGPT [0.0]
This study introduces a novel benchmark that integrates multi-image reasoning tasks with rejection-based evaluation and positional bias detection.<n>We applied this benchmark to assess Grok 3, ChatGPT-4o, ChatGPT-o1, Gemini 2.0 Flash Experimental, DeepSeek Janus models, Qwen2.5-VL-72B-Instruct, QVQ-72B-Preview, and Pixtral 12B.
arXiv Detail & Related papers (2025-02-23T04:01:43Z) - Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning [65.2421542320293]
Reasoning abilities are crucial components of general intelligence.<n>Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks.<n>This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through textbfOutcome textbfREwtextbfArd-based reinforcement textbfLearning for mathematical reasoning tasks.
arXiv Detail & Related papers (2025-02-10T18:57:29Z) - Beyond Slow Signs in High-fidelity Model Extraction [18.330719989672442]
Deep neural networks, costly to train and rich in intellectual property value, are increasingly threatened by model extraction attacks.
Previous attacks have succeeded in reverse-engineering model parameters up to a precision of float64 for models trained on random data with at most three hidden layers.
We introduce a unified optimisation that integrates previous methods and reveal that computational tools can significantly influence performance.
arXiv Detail & Related papers (2024-06-14T13:24:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.