Related papers: Token-Hungry, Yet Precise: DeepSeek R1 Highlights the Need for Multi-Step Reasoning Over Speed in MATH

Token-Hungry, Yet Precise: DeepSeek R1 Highlights the Need for Multi-Step Reasoning Over Speed in MATH

URL: http://arxiv.org/abs/2501.18576v1
Date: Thu, 30 Jan 2025 18:45:51 GMT
Title: Token-Hungry, Yet Precise: DeepSeek R1 Highlights the Need for Multi-Step Reasoning Over Speed in MATH
Authors: Evgenii Evstafev,
Abstract summary: This study investigates the performance of the DeepSeek R1 language model on 30 challenging mathematical problems.<n>DeepSeek R1 achieves superior accuracy on these complex problems but generates significantly more tokens than other models.<n>The findings highlight a trade-off between accuracy and efficiency in mathematical problem-solving with large language models.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This study investigates the performance of the DeepSeek R1 language model on 30 challenging mathematical problems derived from the MATH dataset, problems that previously proved unsolvable by other models under time constraints. Unlike prior work, this research removes time limitations to explore whether DeepSeek R1's architecture, known for its reliance on token-based reasoning, can achieve accurate solutions through a multi-step process. The study compares DeepSeek R1 with four other models (gemini-1.5-flash-8b, gpt-4o-mini-2024-07-18, llama3.1:8b, and mistral-8b-latest) across 11 temperature settings. Results demonstrate that DeepSeek R1 achieves superior accuracy on these complex problems but generates significantly more tokens than other models, confirming its token-intensive approach. The findings highlight a trade-off between accuracy and efficiency in mathematical problem-solving with large language models: while DeepSeek R1 excels in accuracy, its reliance on extensive token generation may not be optimal for applications requiring rapid responses. The study underscores the importance of considering task-specific requirements when selecting an LLM and emphasizes the role of temperature settings in optimizing performance.

Related papers

LUCID: Attention with Preconditioned Representations [14.98859684869003]
We introduce LUCID Attention, an architectural modification that applies a preconditioner to the attention probabilities.<n>This preconditioner, derived from exponentiated key-key similarities, minimizes overlap between the keys in a Reproducing Kernel Hilbert Space.<n>We validate our approach by training 1 billion parameter language models evaluated on up to 128K tokens.
arXiv Detail & Related papers (2026-02-11T01:46:32Z)
DTS: Enhancing Large Reasoning Models via Decoding Tree Sketching [54.98126916293868]
Large Reasoning Models (LRMs) produce excessively long chain-of-thought traces that degrade accuracy.<n>We propose a model-agnostic decoding framework that sketches the reasoning space by branching at high-entropy tokens and applies early stopping to select the shortest completed reasoning path.<n>This approach approximates the optimal solution that enhances both efficiency and accuracy, without requiring additional training or supervision.
arXiv Detail & Related papers (2025-11-01T17:41:28Z)
Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code [76.80306464249217]
We propose TeaR, which aims at teaching LLMs to reason better.<n>TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks.<n>We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning.
arXiv Detail & Related papers (2025-07-10T07:34:05Z)
Probing Deep into Temporal Profile Makes the Infrared Small Target Detector Much Better [63.567886330598945]
Infrared small target (IRST) detection is challenging in simultaneously achieving precise, universal, robust and efficient performance.<n>Current learning-based methods attempt to leverage more" information from both the spatial and the short-term temporal domains.<n>We propose an efficient deep temporal probe network (DeepPro) that only performs calculations in the time dimension for IRST detection.
arXiv Detail & Related papers (2025-06-15T08:19:32Z)
TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression [55.37723860832064]
We propose a dynamic ratio-based training pipeline that does not rely on sophisticated data annotations.<n>We validate our approach across models on DeepSeek-R1-Distill-7B and DeepSeek-R1-Distill-14B and on a diverse set of benchmarks with varying difficulty levels.
arXiv Detail & Related papers (2025-06-03T09:23:41Z)
First Finish Search: Efficient Test-Time Scaling in Large Language Models [20.62274005080048]
First Finish Search (FFS) is a training-free parallel decoding strategy that launches $n$ independent samples and returns as soon as any one completes.<n>FFS achieves $82.23%$ accuracy on the AIME datasets, a $15%$ improvement over DeepSeek-R1's standalone accuracy, nearly matching OpenAI's o4-mini performance.
arXiv Detail & Related papers (2025-05-23T17:57:43Z)
Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math [135.1260782461186]
Chain-of-Thought (CoT) significantly enhances formal reasoning capabilities in Large Language Models (LLMs) However, improving reasoning in Small Language Models (SLMs) remains challenging due to their limited model capacity. We present a systematic training recipe for SLMs that consists of four steps: (1) large-scale mid-training on diverse distilled long-CoT data, (2) supervised fine-tuning on high-quality long-CoT data, (3) Rollout DPO leveraging a carefully curated preference dataset, and (4) Reinforcement Learning (RL) with Verifiable Reward.
arXiv Detail & Related papers (2025-04-30T00:04:35Z)
R-PRM: Reasoning-Driven Process Reward Modeling [53.06844294668382]
Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step. Existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy. We propose Reasoning-Driven Process Reward Modeling (R-PRM) R-PRM generates seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities.
arXiv Detail & Related papers (2025-03-27T09:23:08Z)
Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking [16.441081996257576]
We propose a simple yet effective test-time scaling approach Multi-round Thinking. This method iteratively refines model reasoning by leveraging previous answers as prompts for subsequent rounds. Experiments across multiple models, including QwQ-32B and DeepSeek-R1, consistently show performance improvements.
arXiv Detail & Related papers (2025-03-25T17:19:38Z)
1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Training [16.441081996257576]
The AM-DeepSeek-R1-Distilled is a large-scale dataset with thinking traces for general reasoning tasks. The AM-Distill-Qwen-32B model, which was trained through only simple Supervised Fine-Tuning (SFT), outperformed the DeepSeek-R1-Distill-Qwen-32B model on four benchmarks.
arXiv Detail & Related papers (2025-03-25T13:19:46Z)
START: Self-taught Reasoner with Tools [51.38785489790888]
We introduce START (Self-Taught Reasoner with Tools), a tool-integrated long Chain-of-thought (CoT) reasoning LLM. START is capable of performing complex computations, self-checking, exploring diverse methods, and self-ging. It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B.
arXiv Detail & Related papers (2025-03-06T17:11:51Z)
Bag of Tricks for Inference-time Computation of LLM Reasoning [10.366475014241407]
We investigate and benchmark diverse inference-time computation strategies across reasoning tasks of varying complexity. Our ablation studies reveal that previously overlooked strategies can significantly enhance performance. We establish a standardized benchmark for inference-time computation by systematically evaluating six representative methods across eight reasoning tasks.
arXiv Detail & Related papers (2025-02-11T02:31:11Z)
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [147.16121855209246]
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.<n>DeepSeek-R1-Zero is trained via large-scale reinforcement learning.<n>DeepSeek-R1 incorporates multi-stage training and cold-start data before RL.
arXiv Detail & Related papers (2025-01-22T15:19:35Z)
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [76.43407125275202]
o1-like models can emulate human-like long-time thinking during inference.<n>This paper presents the first comprehensive study on the prevalent issue of overthinking in these models.<n>We propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy.
arXiv Detail & Related papers (2024-12-30T18:55:12Z)
Large Language Models for Scholarly Ontology Generation: An Extensive Analysis in the Engineering Field [0.0]
This paper offers an analysis of the ability of large models to identify semantic relationships between different research topics.<n>We developed a gold standard based on the IEEE Thesaurus to evaluate the task.<n>Several models have achieved outstanding results, including Mixtral-8x7B, Dolphin-Mistral, and Claude 3-7B.
arXiv Detail & Related papers (2024-12-11T10:11:41Z)
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models. It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z)
Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z)
Towards Sustainable Learning: Coresets for Data-efficient Deep Learning [9.51481812606879]
CREST is the first scalable subsetx deep networks framework with rigorous theoretical subsetx experiments on datasets. CREST identifies the most valuable examples of a non-Image function.
arXiv Detail & Related papers (2023-06-02T02:51:08Z)
Representation Learning with Multi-Step Inverse Kinematics: An Efficient and Optimal Approach to Rich-Observation RL [106.82295532402335]
Existing reinforcement learning algorithms suffer from computational intractability, strong statistical assumptions, and suboptimal sample complexity. We provide the first computationally efficient algorithm that attains rate-optimal sample complexity with respect to the desired accuracy level. Our algorithm, MusIK, combines systematic exploration with representation learning based on multi-step inverse kinematics.
arXiv Detail & Related papers (2023-04-12T14:51:47Z)
An Experimental Review on Deep Learning Architectures for Time Series Forecasting [0.0]
We provide the most extensive deep learning study for time series forecasting. Among all studied models, the results show that long short-term memory (LSTM) and convolutional networks (CNN) are the best alternatives. CNNs achieve comparable performance with less variability of results under different parameter configurations, while also being more efficient.
arXiv Detail & Related papers (2021-03-22T17:58:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.