Related papers: Large Language Models Imitate Logical Reasoning, but at what Cost?

Large Language Models Imitate Logical Reasoning, but at what Cost?

URL: http://arxiv.org/abs/2509.12645v1
Date: Tue, 16 Sep 2025 04:03:42 GMT
Title: Large Language Models Imitate Logical Reasoning, but at what Cost?
Authors: Lachlan McGinness, Peter Baumgartner,
Abstract summary: We present a study which evaluates the reasoning capability of frontier Large Language Models over an eighteen month period.<n>We measured the accuracy of three leading models from December 2023, September 2024 and June 2025 on true or false questions.<n>The improvement in performance from 2023 to 2024 can be attributed to hidden Chain of Thought prompting.
Score: 0.42970700836450487
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a longitudinal study which evaluates the reasoning capability of frontier Large Language Models over an eighteen month period. We measured the accuracy of three leading models from December 2023, September 2024 and June 2025 on true or false questions from the PrOntoQA dataset and their faithfulness to reasoning strategies provided through in-context learning. The improvement in performance from 2023 to 2024 can be attributed to hidden Chain of Thought prompting. The introduction of thinking models allowed for significant improvement in model performance between 2024 and 2025. We then present a neuro-symbolic architecture which uses LLMs of less than 15 billion parameters to translate the problems into a standardised form. We then parse the standardised forms of the problems into a program to be solved by Z3, an SMT solver, to determine the satisfiability of the query. We report the number of prompt and completion tokens as well as the computational cost in FLOPs for open source models. The neuro-symbolic approach significantly reduces the computational cost while maintaining near perfect performance. The common approximation that the number of inference FLOPs is double the product of the active parameters and total tokens was accurate within 10\% for all experiments.

Related papers

Prescriptive Scaling Reveals the Evolution of Language Model Capabilities [22.14002750185524]
We estimate capability boundaries, high conditional quantiles of benchmark scores as a function of log pre training FLOPs.<n>We validate the temporal reliability by fitting on earlier model generations and evaluating on later releases.<n>We introduce an efficient algorithm that recovers near full data frontiers using roughly 20% of evaluation budget.
arXiv Detail & Related papers (2026-02-17T03:13:51Z)
Evaluating Large Language Models on the 2026 Korean CSAT Mathematics Exam: Measuring Mathematical Ability in a Zero-Data-Leakage Setting [5.313647446600863]
This study systematically evaluated the mathematical reasoning capabilities of Large Language Models (LLMs) using the 2026 Korean College Scholastic Ability Test (CSAT) Mathematics section.<n>To address data leakage issues in existing benchmarks, we digitized all 46 questions (22 common and 24 elective) within two hours of the exam's public release.
arXiv Detail & Related papers (2025-11-23T23:09:33Z)
Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model [100.86587937568832]
Ring-1T is the first open-source, state-of-the-art thinking model with a trillion-scale parameter.<n>It features 1 trillion total parameters and activates approximately 50 billion per token.
arXiv Detail & Related papers (2025-10-21T17:46:14Z)
Prompting Test-Time Scaling Is A Strong LLM Reasoning Data Augmentation [43.29267000439331]
Large language models (LLMs) have demonstrated impressive reasoning capabilities when provided with chain-of-thought exemplars.<n>In this work, we introduce Prompting Test-Time Scaling (P-TTS), a simple yet effective inference-time data augmentation strategy.
arXiv Detail & Related papers (2025-10-10T17:57:04Z)
MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes [60.57770396565211]
We show that strong reasoning abilities can emerge with far less data.<n>MobileLLM-R50M achieves an AIME score of 15.5, compared to just 0.6 for OLMo-2-1.48B and 0.3 for SmolLM-2-1.7B.
arXiv Detail & Related papers (2025-09-29T15:43:59Z)
Datarus-R1: An Adaptive Multi-Step Reasoning LLM for Automated Data Analysis [0.0]
We present Datarus-R1-14B, a language model fine-tuned from Qwen 2.5-14B-Instruct to act as a virtual data analyst and graduate-level problem solver.<n>Datarus is trained not on isolated question-answer pairs but on full analytical trajectories including reasoning steps, code execution, error traces, self-corrections, and final conclusions.
arXiv Detail & Related papers (2025-08-18T21:58:18Z)
Logit Arithmetic Elicits Long Reasoning Capabilities Without Training [14.015546463427732]
Large reasoning models (LRMs) can do complex reasoning via long chain-of-thought (CoT) involving cognitive strategies such as backtracking and self-correction.<n>Recent studies suggest that some models inherently possess these long reasoning abilities, which may be unlocked via extra training.<n>We propose a decoding-time approach, ThinkLogit, to tune a target large LM for long reasoning using a substantially smaller model as guider.
arXiv Detail & Related papers (2025-07-17T03:31:36Z)
Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code [76.80306464249217]
We propose TeaR, which aims at teaching LLMs to reason better.<n>TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks.<n>We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning.
arXiv Detail & Related papers (2025-07-10T07:34:05Z)
ASTRO: Teaching Language Models to Reason by Reflecting and Backtracking In-Context [66.15505423059234]
We introduce ASTRO, a framework for training language models to reason like search algorithms.<n>We apply ASTRO to the Llama 3 family of models and achieve absolute performance gains of 16.4% on MATH-500, 26.9% on AMC 2023, and 20.0% on AIME 2024.
arXiv Detail & Related papers (2025-07-01T04:10:15Z)
Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning [65.2421542320293]
Reasoning abilities are crucial components of general intelligence.<n>Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks.<n>This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through textbfOutcome textbfREwtextbfArd-based reinforcement textbfLearning for mathematical reasoning tasks.
arXiv Detail & Related papers (2025-02-10T18:57:29Z)
Beyond Scaling: Measuring and Predicting the Upper Bound of Knowledge Retention in Language Model Pre-Training [51.41246396610475]
This paper aims to predict performance in closed-book question answering (QA) without the help of external tools.<n>We conduct large-scale retrieval and semantic analysis across the pre-training corpora of 21 publicly available and 3 custom-trained large language models.<n>Building on these foundations, we propose Size-dependent Mutual Information (SMI), an information-theoretic metric that linearly correlates pre-training data characteristics.
arXiv Detail & Related papers (2025-02-06T13:23:53Z)
Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination [35.88131356701857]
This dataset consists of 1003 multiple-choice questions of university entrance level exams in Spanish and English.<n>A selection of current open-source and proprietary models are evaluated in a uniform zero-shot experimental setting.
arXiv Detail & Related papers (2024-09-19T13:13:07Z)
Tight Guarantees for Interactive Decision Making with the Decision-Estimation Coefficient [51.37720227675476]
We introduce a new variant of the Decision-Estimation Coefficient, and use it to derive new lower bounds that improve upon prior work on three fronts. We provide upper bounds on regret that scale with the same quantity, thereby closing all but one of the gaps between upper and lower bounds in Foster et al. Our results apply to both the regret framework and PAC framework, and make use of several new analysis and algorithm design techniques that we anticipate will find broader use.
arXiv Detail & Related papers (2023-01-19T18:24:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.