Related papers: InductionBench: LLMs Fail in the Simplest Complexity Class

InductionBench: LLMs Fail in the Simplest Complexity Class

URL: http://arxiv.org/abs/2502.15823v3
Date: Mon, 03 Mar 2025 16:38:10 GMT
Title: InductionBench: LLMs Fail in the Simplest Complexity Class
Authors: Wenyue Hua, Tyler Wong, Sun Fei, Liangming Pan, Adam Jardine, William Yang Wang,
Abstract summary: Large language models (LLMs) have shown remarkable improvements in reasoning.<n>Inductive reasoning, where one infers the underlying rules from observed data, remains less explored.<n>We introduce InductionBench, a new benchmark designed to evaluate the inductive reasoning ability of LLMs.
Score: 53.70978746199222
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have shown remarkable improvements in reasoning and many existing benchmarks have been addressed by models such as o1 and o3 either fully or partially. However, a majority of these benchmarks emphasize deductive reasoning, including mathematical and coding tasks in which rules such as mathematical axioms or programming syntax are clearly defined, based on which LLMs can plan and apply these rules to arrive at a solution. In contrast, inductive reasoning, where one infers the underlying rules from observed data, remains less explored. Such inductive processes lie at the heart of scientific discovery, as they enable researchers to extract general principles from empirical observations. To assess whether LLMs possess this capacity, we introduce InductionBench, a new benchmark designed to evaluate the inductive reasoning ability of LLMs. Our experimental findings reveal that even the most advanced models available struggle to master the simplest complexity classes within the subregular hierarchy of functions, highlighting a notable deficiency in current LLMs' inductive reasoning capabilities. Coda and data are available https://github.com/Wenyueh/inductive_reasoning_benchmark.

Related papers

On LLM-Based Scientific Inductive Reasoning Beyond Equations [51.61971971921903]
We propose the task of LLM-Based Scientific Inductive Reasoning Beyond Equations.<n>We introduce a new benchmark, SIRBench-V1, to evaluate the inductive reasoning abilities of LLMs in scientific settings.
arXiv Detail & Related papers (2025-09-12T10:11:52Z)
Revisiting LLM Reasoning via Information Bottleneck [57.519119962528166]
Large language models (LLMs) have recently demonstrated remarkable progress in reasoning capabilities through reinforcement learning with verifiable rewards (RLVR)<n>We present a theoretical characterization of LLM reasoning grounded in information bottleneck (IB) principle.<n>We propose IB-aware reasoning optimization (IBRO), a framework that encourages reasoning trajectories to be both informative about the final correct answer and generalizable.
arXiv Detail & Related papers (2025-07-24T13:14:25Z)
Patterns Over Principles: The Fragility of Inductive Reasoning in LLMs under Noisy Observations [43.491353243991284]
We introduce Robust Rule Induction, a task that evaluates large language models' capability in inferring rules from data fused with noisy examples.<n>We also propose Sample-steered Rule Refinement (SRR), a method enhancing reasoning stability via observation diversification and execution-guided feedback.<n>Our findings challenge LLMs' reasoning, revealing susceptibility to hypothesis drift and pattern overfitting, while providing empirical evidence critical for developing human-like inductive systems.
arXiv Detail & Related papers (2025-02-22T10:03:19Z)
JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models [51.99046112135311]
We introduce JustLogic, a synthetically generated deductive reasoning benchmark for rigorous evaluation of Large Language Models.<n>JustLogic is highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures.<n>Our experimental results reveal that most state-of-the-art (SOTA) LLMs perform significantly worse than the human average.
arXiv Detail & Related papers (2025-01-24T15:49:10Z)
Argumentation Computation with Large Language Models : A Benchmark Study [6.0682923348298194]
Large language models (LLMs) have made significant advancements in neuro-symbolic computing. We aim to investigate the capability of LLMs in determining the extensions of various abstract argumentation semantics.
arXiv Detail & Related papers (2024-12-21T18:23:06Z)
RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios [58.90106984375913]
RuleArena is a novel and challenging benchmark designed to evaluate the ability of large language models (LLMs) to follow complex, real-world rules in reasoning.<n> Covering three practical domains -- airline baggage fees, NBA transactions, and tax regulations -- RuleArena assesses LLMs' proficiency in handling intricate natural language instructions.
arXiv Detail & Related papers (2024-12-12T06:08:46Z)
MIRAGE: Evaluating and Explaining Inductive Reasoning Process in Language Models [19.81485079689837]
We evaluate large language models' capabilities in inductive and deductive stages. We find that the models tend to consistently conduct correct deduction without correct inductive rules. In the inductive reasoning process, the model tends to focus on observed facts that are close to the current test example in feature space.
arXiv Detail & Related papers (2024-10-12T14:12:36Z)
Inductive Learning of Logical Theories with LLMs: An Expressivity-Graded Analysis [9.865771016218549]
This work presents a novel systematic methodology to analyse the capabilities and limitations of Large Language Models (LLMs)<n>The analysis is complexity-graded w.r.t. rule dependency structure, allowing quantification of specific inference challenges on LLM performance.
arXiv Detail & Related papers (2024-08-15T16:41:00Z)
Inductive or Deductive? Rethinking the Fundamental Reasoning Abilities of LLMs [99.76347807139615]
Reasoning encompasses two typical types: deductive reasoning and inductive reasoning. Despite extensive research into the reasoning capabilities of Large Language Models (LLMs), most studies have failed to rigorously differentiate between inductive and deductive reasoning. This raises an essential question: In LLM reasoning, which poses a greater challenge - deductive or inductive reasoning?
arXiv Detail & Related papers (2024-07-31T18:47:11Z)
Case2Code: Scalable Synthetic Data for Code Generation [105.89741089673575]
Large Language Models (LLMs) have shown outstanding breakthroughs in code generation.<n>Recent work improves code LLMs by training on synthetic data generated by some powerful LLMs.<n>We propose a textbfCase2Code task by exploiting the expressiveness and correctness of programs.
arXiv Detail & Related papers (2024-07-17T11:35:00Z)
Can LLMs Reason with Rules? Logic Scaffolding for Stress-Testing and Improving LLMs [87.34281749422756]
Large language models (LLMs) have achieved impressive human-like performance across various reasoning tasks. However, their mastery of underlying inferential rules still falls short of human capabilities. We propose a logic scaffolding inferential rule generation framework, to construct an inferential rule base, ULogic.
arXiv Detail & Related papers (2024-02-18T03:38:51Z)
LLMs for Relational Reasoning: How Far are We? [8.840750655261251]
Large language models (LLMs) have revolutionized many areas by achieving state-of-the-art performance on downstream tasks. Recent efforts have demonstrated that the LLMs are poor at solving sequential decision-making problems.
arXiv Detail & Related papers (2024-01-17T08:22:52Z)
Phenomenal Yet Puzzling: Testing Inductive Reasoning Capabilities of Language Models with Hypothesis Refinement [92.61557711360652]
Language models (LMs) often fall short on inductive reasoning, despite achieving impressive success on research benchmarks. We conduct a systematic study of the inductive reasoning capabilities of LMs through iterative hypothesis refinement. We reveal several discrepancies between the inductive reasoning processes of LMs and humans, shedding light on both the potentials and limitations of using LMs in inductive reasoning tasks.
arXiv Detail & Related papers (2023-10-12T17:51:10Z)
Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models [56.34029644009297]
Large language models (LLMs) have demonstrated the ability to overcome various limitations of formal Knowledge Representation (KR) systems. LLMs excel most in abductive reasoning, followed by deductive reasoning, while they are least effective at inductive reasoning. We study single-task training, multi-task training, and "chain-of-thought" knowledge distillation fine-tuning technique to assess the performance of model.
arXiv Detail & Related papers (2023-10-02T01:00:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.