Related papers: Scalable LLM Math Reasoning Acceleration with Low-rank Distillation

Scalable LLM Math Reasoning Acceleration with Low-rank Distillation

URL: http://arxiv.org/abs/2505.07861v2
Date: Tue, 30 Sep 2025 14:59:16 GMT
Title: Scalable LLM Math Reasoning Acceleration with Low-rank Distillation
Authors: Harry Dong, Bilge Acun, Beidi Chen, Yuejie Chi,
Abstract summary: We propose a resource-efficient distillation method to recover lost capabilities from deploying efficient inference methods.<n>With original weights unperturbed, roughly 1% of additional parameters, and only 20K synthetic training samples, we are able to recover much if not all of the math capabilities lost.
Score: 57.922185576872444
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Due to long generations, large language model (LLM) math reasoning demands significant computational resources and time. While many existing efficient inference methods have been developed with excellent performance preservation on language tasks, they often severely degrade math performance. In this paper, we propose Caprese, a resource-efficient distillation method to recover lost capabilities from deploying efficient inference methods, focused primarily in feedforward blocks. With original weights unperturbed, roughly 1% of additional parameters, and only 20K synthetic training samples, we are able to recover much if not all of the math capabilities lost from efficient inference for thinking LLMs and without harm to language tasks for instruct LLMs. Moreover, Caprese slashes the number of active parameters (~2B cut for Gemma 2 9B and Llama 3.1 8B) and integrates cleanly into existing model layers to reduce latency (>16% time-to-next-token reduction) while encouraging response brevity (up to 8.5% fewer tokens).

Related papers

SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsity [52.88892280536302]
We introduce SparseLoRA, a method that accelerates fine-tuning through contextual sparsity.<n>We show that SparseLoRA reduces computational cost by up to 2.2 times and a measured speedup of up to 1.6 times.
arXiv Detail & Related papers (2025-06-19T17:53:34Z)
Skipping Computations in Multimodal LLMs [63.29737699997859]
This study investigates redundancy in Multimodal Large Language Models (MLLMs) during inference. We propose different methods to skip computations, such as skipping entire blocks, FFN or self-attention layers. Our findings validate that significant amount of computations can be avoided at inference time.
arXiv Detail & Related papers (2024-10-12T09:21:45Z)
Q-Sparse: All Large Language Models can be Fully Sparsely-Activated [93.45300714803429]
We introduce Q-Sparse, a simple yet effective approach to training sparsely-activated large language models (LLMs) Q-Sparse enables full sparsity of activations in LLMs which can bring significant efficiency gains in inference. We also introduce Block Q-Sparse for batch training and inference.
arXiv Detail & Related papers (2024-07-15T17:59:29Z)
ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization [13.622268474310918]
ShiftAddLLM is an efficient multiplication-free model for large language models. It achieves perplexity improvements of 5.6 and 22.7 points at comparable or lower latency. Experiments on five LLM families and eight tasks consistently validate the effectiveness of ShiftAddLLM.
arXiv Detail & Related papers (2024-06-10T02:47:55Z)
VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections [35.133698935322634]
Large language models (LLMs) have recently emerged as powerful tools for tackling many language-processing tasks. We identify and characterise the important components needed for effective model convergence using gradient descent. This result leads us to a cheap and memory-efficient algorithm for both fine-tuning and pre-training LLMs.
arXiv Detail & Related papers (2024-05-28T09:23:14Z)
Characterizing the Accuracy -- Efficiency Trade-off of Low-rank Decomposition in Language Models [1.401463252785724]
Low-rank decomposition can be a promising direction for LLM-based applications that require real-time service at scale. We formalize the low-rank decomposition design space and show that the decomposition design space is enormous. Our results show that we can achieve a 9% model size reduction with minimal accuracy drops.
arXiv Detail & Related papers (2024-05-10T17:40:02Z)
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs. We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z)
FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation. To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed. We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z)
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models. It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z)
One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models [42.95555008229016]
We propose a method based on Hessian sensitivity-aware mixed sparsity pruning to prune LLMs to at least 50% sparsity without the need of any retraining. The advantages of the proposed method exhibit even more when the sparsity is extremely high.
arXiv Detail & Related papers (2023-10-14T05:43:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.