Related papers: Not All Bits Are Equal: Scale-Dependent Memory Optimization Strategies for Reasoning Models

Not All Bits Are Equal: Scale-Dependent Memory Optimization Strategies for Reasoning Models

URL: http://arxiv.org/abs/2510.10964v1
Date: Mon, 13 Oct 2025 03:14:28 GMT
Title: Not All Bits Are Equal: Scale-Dependent Memory Optimization Strategies for Reasoning Models
Authors: Junhyuck Kim, Ethan Ewer, Taehong Moon, Jongho Park, Dimitris Papailiopoulos,
Abstract summary: 4-bit quantization has emerged as a memory-optimal choice for non-reasoning models and zero-shot tasks across scales.<n>We show that this universal prescription fails for reasoning models, where the KV cache rather than model size can dominate memory.<n>We find a scale-dependent trade-off: models with an effective size below 8-bit 4B parameters achieve better accuracy by allocating memory to more weights rather than longer generation.
Score: 10.604862875916103
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While 4-bit quantization has emerged as a memory-optimal choice for non-reasoning models and zero-shot tasks across scales, we show that this universal prescription fails for reasoning models, where the KV cache rather than model size can dominate memory. Through systematic experiments across 1,700 inference scenarios on AIME25 and GPQA-Diamond, we find a scale-dependent trade-off: models with an effective size below 8-bit 4B parameters achieve better accuracy by allocating memory to more weights rather than longer generation, while larger models achieve better accuracy by allocating memory to longer generations. This scale threshold also determines when parallel scaling becomes memory-efficient and whether KV cache eviction outperforms KV quantization. Our findings show that memory optimization for LLMs cannot be scale-agnostic, while providing principled guidelines: for small reasoning models, prioritize model capacity over test-time compute, while for larger ones, maximize test-time compute. Our results suggest that optimizing reasoning models for deployment requires fundamentally different strategies from those established for non-reasoning models.

Related papers

Compute-Optimal Scaling for Value-Based Deep RL [99.680827753493]
We investigate compute scaling for online, value-based deep RL.<n>Our analysis reveals a nuanced interplay between model size, batch size, and UTD.<n>We provide a mental model for understanding this phenomenon and build guidelines for choosing batch size and UTD.
arXiv Detail & Related papers (2025-08-20T17:54:21Z)
Low-Resolution Neural Networks [0.552480439325792]
This study examines the impact of parameter bit precision on model performance compared to standard 32-bit models.<n>Models analyzed include those with fully connected layers, convolutional layers, and transformer blocks.<n>Findings suggest a potential new era for optimized neural network models with reduced memory requirements and improved computational efficiency.
arXiv Detail & Related papers (2025-02-12T21:19:28Z)
Memory Layers at Scale [67.00854080570979]
This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale.<n>On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the budget, as well as mixture-of-expert models when matched for both compute and parameters.<n>We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.
arXiv Detail & Related papers (2024-12-12T23:56:57Z)
MOFHEI: Model Optimizing Framework for Fast and Efficient Homomorphically Encrypted Neural Network Inference [0.8388591755871735]
Homomorphic Encryption (HE) enables us to perform machine learning tasks over encrypted data.<n>We propose MOFHEI, a framework that optimize the model to make HE-based neural network inference, fast and efficient.<n>Our framework achieves up to 98% pruning ratio on LeNet, eliminating up to 93% of the required HE operations for performing PI.
arXiv Detail & Related papers (2024-12-10T22:44:54Z)
Quantifying the Capabilities of LLMs across Scale and Precision [12.879551933541345]
This study investigates the effect of model scale and quantization on the performance of instruct models. We show that larger models show exceptional resilience to precision reduction and can maintain high accuracy even at 4-bit quantization.
arXiv Detail & Related papers (2024-05-06T03:42:34Z)
Pruning Large Language Models via Accuracy Predictor [0.0]
Large language models (LLMs) containing tens of billions of parameters (or even more) have demonstrated impressive capabilities in various NLP tasks. We propose a novel pruning approach: firstly, a training set of a certain number of architecture-accuracy pairs is established, and then a non-neural model is trained as an accuracy predictor.
arXiv Detail & Related papers (2023-09-18T06:38:24Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
QLoRA: Efficient Finetuning of Quantized LLMs [66.58009990713134]
We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU. QLoRA backpropagates through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters(LoRA) Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark.
arXiv Detail & Related papers (2023-05-23T17:50:33Z)
The case for 4-bit precision: k-bit Inference Scaling Laws [75.4335600212427]
Quantization methods reduce the number of bits required to represent each parameter in a model. The final model size depends on both the number of parameters of the original model and the rate of compression. We run more than 35,000 zero-shot experiments with 16-bit inputs and k-bit parameters to examine which quantization methods improve scaling for 3 to 8-bit precision.
arXiv Detail & Related papers (2022-12-19T18:48:33Z)
A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental Learning [56.450090618578]
Class-Incremental Learning (CIL) aims to train a model with limited memory size to meet this requirement. We show that when counting the model size into the total budget and comparing methods with aligned memory size, saving models do not consistently work. We propose a simple yet effective baseline, denoted as MEMO for Memory-efficient Expandable MOdel.
arXiv Detail & Related papers (2022-05-26T08:24:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.