Related papers: You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations

You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations

URL: http://arxiv.org/abs/2511.06516v1
Date: Sun, 09 Nov 2025 19:58:24 GMT
Title: You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations
Authors: Amit LeVi, Raz Lapid, Rom Himelstein, Yaniv Nemcovsky, Ravid Shwartz Ziv, Avi Mendelson,
Abstract summary: We propose to use hidden representations that encode task-salient signals as a guideline for quantization.<n>This paper compares two new task-aware PTQ methods: Task-Aware Quantization (TAQ), which allocates bitwidths using task-conditioned statistics from hidden activations, and TAQO, which allocates precision based on direct layer sensitivity tests.<n>Across models, TAQ and TAQO outperform the baselines; TAQ leads on Phi-4, while TAQO leads on Llama-3.1, Qwen3, and Qwen2.5.
Score: 7.404908118728373
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) excel across diverse tasks, yet many applications require only limited capabilities, making large variants inefficient in memory and latency. Existing approaches often combine distillation and quantization, but most post-training quantization (PTQ) methods are task-agnostic, ignoring how task-specific signals are distributed across layers. In this work, we propose to use hidden representations that encode task-salient signals as a guideline for quantization. In order to fully utilize our innovative idea, this paper compares two new task-aware PTQ methods: Task-Aware Quantization (TAQ), which allocates bitwidths using task-conditioned statistics from hidden activations, and TAQO, which allocates precision based on direct layer sensitivity tests. From a small calibration set, these approaches identify task-relevant layers, preserving their precision while aggressively quantizing the rest. This yields stable task sensitivity profiles and efficient task-specialized models. Across models, TAQ and TAQO outperform the baselines; TAQ leads on Phi-4, while TAQO leads on Llama-3.1, Qwen3, and Qwen2.5. For instances, on Phi-4 it achieves 42.33 EM / 50.81 F1, far surpassing Activation-aware Weight Quantization (AWQ) (2.25 / 7.07), while remaining within < 1.0% of the original accuracy at lower average precision.

Related papers

What Makes Low-Bit Quantization-Aware Training Work for Reasoning LLMs? A Systematic Study [59.44848132298657]
Post-training quantization (PTQ) usually comes with the cost of large accuracy drops, especially for reasoning tasks under low-bit settings.<n>In this study, we present a systematic empirical study of quantization-aware training (QAT) for reasoning models.
arXiv Detail & Related papers (2026-01-21T11:22:29Z)
GPLQ: A General, Practical, and Lightning QAT Method for Vision Transformers [11.452135395287119]
Vision Transformers (ViTs) are essential in computer vision but are computationally intensive, too.<n>Model quantization aims to alleviate this difficulty, yet existing Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) methods exhibit significant limitations.<n>This paper introduces General, Practical, and Quantization (GPLQ), a novel framework for efficient ViT quantization.
arXiv Detail & Related papers (2025-06-13T13:45:17Z)
How Many Parameters Does Your Task Really Need? Task Specific Pruning with LLM-Sieve [2.33361323991006]
Large Language Models (LLMs) are increasingly deployed for narrow tasks in resource-constrained settings.<n>We present LLM-Sieve, a framework that prunes LLMs down to the minimal parameter subset needed to preserve task performance.
arXiv Detail & Related papers (2025-05-23T20:17:20Z)
Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression [55.323397702682506]
Post-training quantization (PTQ) reduces a model's memory footprint by mapping full precision weights into low bit weights without costly retraining.<n>We develop a new mixed-precision PTQ approach, Task-Circuit Quantization (TaCQ), that draws parallels to automated circuit discovery.
arXiv Detail & Related papers (2025-04-10T02:19:03Z)
RSQ: Learning from Important Tokens Leads to Better Quantized LLMs [65.5558181902098]
Layer-wise quantization is a key technique for efficiently compressing large models without expensive retraining.<n>We propose RSQ (Rotate, Scale, then Quantize), which applies rotations to the model to mitigate outliers.<n>We demonstrate that RSQ consistently outperforms baseline methods across multiple downstream tasks and three model families.
arXiv Detail & Related papers (2025-03-03T18:46:33Z)
Localizing Task Information for Improved Model Merging and Compression [61.16012721460561]
We show that the information required to solve each task is still preserved after merging as different tasks mostly use non-overlapping sets of weights. We propose Consensus Merging, an algorithm that eliminates such weights and improves the general performance of existing model merging approaches.
arXiv Detail & Related papers (2024-05-13T14:54:37Z)
Task Adaptive Parameter Sharing for Multi-Task Learning [114.80350786535952]
Adaptive Task Adapting Sharing (TAPS) is a method for tuning a base model to a new task by adaptively modifying a small, task-specific subset of layers. Compared to other methods, TAPS retains high accuracy on downstream tasks while introducing few task-specific parameters. We evaluate our method on a suite of fine-tuning tasks and architectures (ResNet, DenseNet, ViT) and show that it achieves state-of-the-art performance while being simple to implement.
arXiv Detail & Related papers (2022-03-30T23:16:07Z)
TOOD: Task-aligned One-stage Object Detection [41.43371563426291]
One-stage object detection is commonly implemented by optimizing two sub-tasks: object classification and localization. We propose a Task-aligned One-stage Object Detection (TOOD) that explicitly aligns the two tasks in a learning-based manner. Experiments are conducted on MS-COCO, where TOOD achieves a 51.1 AP at single-model single-scale testing.
arXiv Detail & Related papers (2021-08-17T17:00:01Z)
Meta-Generating Deep Attentive Metric for Few-shot Classification [53.07108067253006]
We present a novel deep metric meta-generation method to generate a specific metric for a new few-shot learning task. In this study, we structure the metric using a three-layer deep attentive network that is flexible enough to produce a discriminative metric for each task. We gain surprisingly obvious performance improvement over state-of-the-art competitors, especially in the challenging cases.
arXiv Detail & Related papers (2020-12-03T02:07:43Z)
Conditional Channel Gated Networks for Task-Aware Continual Learning [44.894710899300435]
Convolutional Neural Networks experience catastrophic forgetting when optimized on a sequence of learning problems. We introduce a novel framework to tackle this problem with conditional computation. We validate our proposal on four continual learning datasets.
arXiv Detail & Related papers (2020-03-31T19:35:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.