Related papers: When Reasoning Meets Compression: Benchmarking Compressed Large Reasoning Models on Complex Reasoning Tasks

When Reasoning Meets Compression: Benchmarking Compressed Large Reasoning Models on Complex Reasoning Tasks

URL: http://arxiv.org/abs/2504.02010v1
Date: Wed, 02 Apr 2025 05:17:46 GMT
Title: When Reasoning Meets Compression: Benchmarking Compressed Large Reasoning Models on Complex Reasoning Tasks
Authors: Nan Zhang, Yusen Zhang, Prasenjit Mitra, Rui Zhang,
Abstract summary: compression of large language models (LLMs) offers an effective solution to reduce cost of computational resources.<n>We benchmark compressed DeepSeek-R1 models on four different reasoning datasets.<n>We find that parameter count has a much greater impact on LRMs' knowledge than memorization.
Score: 11.656636716718175
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent open-source large reasoning models (LRMs) exhibit strong performance on complex reasoning tasks, but their large parameter count makes them prohibitively expensive for individuals. The compression of large language models (LLMs) offers an effective solution to reduce cost of computational resources. However, systematic studies on the performance of compressed LLMs in complex reasoning tasks, especially for LRMs, are lacking. Most works on quantization and pruning focus on preserving language modeling performance, while existing distillation works do not comprehensively benchmark student models based on reasoning difficulty or compression impact on knowledge and reasoning. In this paper, we benchmark compressed DeepSeek-R1 models on four different reasoning datasets (AIME 2024, FOLIO, Temporal Sequences of BIG-Bench Hard, and MuSiQue), ranging from mathematical to multihop reasoning, using quantization, distillation, and pruning methods. We benchmark 2.51-, 1.73-, and 1.58-bit R1 models that adopt dynamic quantization. We also benchmark distilled R1 models that are based on LLaMA or Qwen and run SparseGPT on them to obtain various sparsity levels. Studying the performance and behavior of compressed LRMs, we report their performance scores and test-time compute (number of tokens spent on each question). Notably, using MuSiQue, we find that parameter count has a much greater impact on LRMs' knowledge memorization than on their reasoning capability, which can inform the choice of compression techniques. Through our empirical analysis of test-time compute, we find that shorter model outputs generally achieve better performance than longer ones across several benchmarks for both R1 and its compressed variants, highlighting the need for more concise reasoning chains.

Related papers

SplitReason: Learning To Offload Reasoning [7.016347390223799]
Reasoning in large language models (LLMs) tends to produce substantially longer token generation sequences than simpler language modeling tasks. We leverage this by offloading only the most challenging parts of the reasoning process to a larger, more capable model. This approach improves AIME24 reasoning accuracy by 24% and 28.3% while offloading 1.35% and 5% of the generated tokens respectively.
arXiv Detail & Related papers (2025-04-23T03:00:02Z)
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models [72.75501495786297]
We introduce a novel hybrid linear RNN reasoning model, M1, built on the Mamba architecture. Experimental results show that M1 not only outperforms previous linear RNN models but also matches the performance of state-of-the-art DeepSeek R1 distilled reasoning models.
arXiv Detail & Related papers (2025-04-14T17:38:25Z)
SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning [14.020244011380063]
SpecReason is a system that accelerates LRM inference by using a lightweight model to (speculatively) carry out simpler intermediate reasoning steps. It achieves 1.5-2.5$times$ speedup over vanilla LRM inference while improving accuracy by 1.0-9.9%. Compared to speculative decoding without SpecReason, their combination yields an additional 19.4-44.2% latency reduction.
arXiv Detail & Related papers (2025-04-10T16:05:19Z)
Revisiting Prompt Optimization with Large Reasoning Models-A Case Study on Event Extraction [8.88001387249786]
Large Reasoning Models (LRMs) such as DeepSeek-R1 and OpenAI o1 have demonstrated remarkable capabilities in various reasoning tasks. Their strong capability to generate and reason over intermediate thoughts has led to arguments that they may no longer require extensive prompt engineering or optimization to interpret human instructions. In this work, we aim to systematically study this open question, using the structured task of event extraction for a case study.
arXiv Detail & Related papers (2025-04-10T00:53:59Z)
START: Self-taught Reasoner with Tools [51.38785489790888]
We introduce START (Self-Taught Reasoner with Tools), a tool-integrated long Chain-of-thought (CoT) reasoning LLM.<n> START is capable of performing complex computations, self-checking, exploring diverse methods, and self-ging.<n>It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B.
arXiv Detail & Related papers (2025-03-06T17:11:51Z)
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [76.43407125275202]
o1-like models can emulate human-like long-time thinking during inference.<n>This paper presents the first comprehensive study on the prevalent issue of overthinking in these models.<n>We propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy.
arXiv Detail & Related papers (2024-12-30T18:55:12Z)
LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment [36.958867918858296]
Large language models (LLMs) have demonstrated their strong intelligence ability, but high demand for computation and storage hinders their practical application. We present the Large Language Model Compression Benchmark (LLMCBench), a rigorously designed benchmark with an in-depth analysis for LLM compression algorithms.
arXiv Detail & Related papers (2024-10-28T14:45:01Z)
Mixture of Parrots: Experts improve memorization more than reasoning [72.445819694797]
We show that as we increase the number of experts, the memorization performance consistently increases while the reasoning capabilities saturate. We find that increasing the number of experts helps solve knowledge-intensive tasks, but fails to yield the same benefits for reasoning tasks.
arXiv Detail & Related papers (2024-10-24T17:54:41Z)
A Comparative Study on Reasoning Patterns of OpenAI's o1 Model [69.08287909042421]
We show that OpenAI's o1 model has achieved the best performance on most datasets. We also provide a detailed analysis on several reasoning benchmarks.
arXiv Detail & Related papers (2024-10-17T15:09:03Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models. We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization. Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z)
The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models [11.156816338995503]
Large language models (LLMs) provide faster inference, smaller memory footprints, and enables local deployment. Two standard compression techniques are pruning and quantization, with the former eliminating redundant connections in model layers and the latter representing model parameters with fewer bits. Existing research on LLM compression primarily focuses on performance in terms of general metrics like perplexity or downstream task accuracy. More fine-grained metrics, such as those measuring parametric knowledge, remain significantly underexplored.
arXiv Detail & Related papers (2023-12-01T22:27:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.