Related papers: Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2

Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2

URL: http://arxiv.org/abs/2512.22671v1
Date: Sat, 27 Dec 2025 18:09:57 GMT
Title: Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2
Authors: Pere Martra,
Abstract summary: We show that MAW-guided width pruning acts as a selective filter, reducing parametric knowledge while preserving or enhancing behavioral alignment.<n>We quantify context-dependent efficiency trade-offs: pruned configurations achieve up to 23% reduction in energy consumption (J/token) but incur penalties in single-request latency.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Structured width pruning of GLU-MLP layers, guided by the Maximum Absolute Weight (MAW) criterion, reveals a systematic dichotomy in how reducing the expansion ratio affects different model capabilities. While performance on tasks relying on parametric knowledge (e.g., MMLU, GSM8K) and perplexity metrics degrades predictably, instruction-following capabilities improve substantially (+46% to +75% in IFEval for Llama-3.2-1B and 3B models), and multi-step reasoning remains robust (MUSR). This pattern challenges the prevailing assumption that pruning induces uniform degradation. We evaluated seven expansion ratio configurations using comprehensive benchmarks assessing factual knowledge, mathematical reasoning, language comprehension, instruction-following, and truthfulness. Our analysis identifies the expansion ratio as a critical architectural parameter that selectively modulates cognitive capabilities, rather than merely serving as a compression metric. We provide the first systematic characterization of this selective preservation phenomenon. Notably, we document a robust inverse correlation (r = -0.864, p = 0.012 in Llama-3B) between factual knowledge capacity (MMLU) and truthfulness metrics (TruthfulQA-MC2): as knowledge degrades, the model's ability to discriminate misconceptions improves consistently. This connects two previously distinct research areas, demonstrating that MAW-guided width pruning acts as a selective filter, reducing parametric knowledge while preserving or enhancing behavioral alignment. Additionally, we quantify context-dependent efficiency trade-offs: pruned configurations achieve up to 23% reduction in energy consumption (J/token) but incur penalties in single-request latency, whereas batch processing workloads benefit uniformly.

Related papers

ECO: Quantized Training without Full-Precision Master Weights [58.97082407934466]
Error-Compensating (ECO) eliminates master weights by applying updates directly to quantized parameters.<n>We show that ECO converges to a constant-radius neighborhood of the optimum, while naive master-weight removal can incur an error that is inversely proportional to the learning rate.
arXiv Detail & Related papers (2026-01-29T18:35:01Z)
Representation-Aware Unlearning via Activation Signatures: From Suppression to Knowledge-Signature Erasure [2.0017902634527194]
We introduce Knowledge Immunization Framework (KIF), a representation-aware architecture that distinguishes genuine erasure from obfuscation.<n>Our approach combines dynamic suppression of subject-specific representations with parameter-efficient adaptation, enabling durable unlearning without full model retraining.
arXiv Detail & Related papers (2026-01-15T16:28:14Z)
LLMs can Compress LLMs: Adaptive Pruning by Agents [0.0]
Post-training pruning has emerged as a promising approach to reduce computational costs while preserving performance.<n>We introduce agent-guided pruning, where a foundation model acts as an adaptive pruning agent.<n>We evaluate our approach on Q3 models (4B and 8B parameters) at approximately 45% sparsity, demonstrating substantial improvements over structured pruning baselines.
arXiv Detail & Related papers (2026-01-14T18:45:36Z)
Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models [31.422773877490613]
Reasoning LLMs (RLMs) deliver strong multi-step reasoning through chain-of-thought generation.<n>RLMs' large model sizes and lengthy decode-time outputs make them costly to deploy and unsuitable for resource-constrained settings.<n>We introduce RESP, a structured pruning framework that aligns pruning decisions with the model's reasoning dynamics.
arXiv Detail & Related papers (2025-12-01T20:27:05Z)
ExplicitLM: Decoupling Knowledge from Parameters via Explicit Memory Banks [4.099810580680816]
Large language models suffer from knowledge staleness and lack of interpretability due to implicit knowledge storage.<n>We propose ExplicitLM, a novel architecture featuring a million-scale external memory bank storing human-readable knowledge as token sequences.
arXiv Detail & Related papers (2025-11-03T13:53:19Z)
Capability Ceilings in Autoregressive Language Models: Empirical Evidence from Knowledge-Intensive Tasks [0.2538209532048866]
We document capability ceilings in decoder-only autoregressive language models across knowledge-intensive tasks.<n>We quantify capability-specific scaling failures in OPT and Pythia model families to inform resource allocation decisions.
arXiv Detail & Related papers (2025-10-23T11:09:31Z)
The Dual-use Dilemma in LLMs: Do Empowering Ethical Capacities Make a Degraded Utility? [54.18519360412294]
Large Language Models (LLMs) must balance between rejecting harmful requests for safety and accommodating legitimate ones for utility.<n>This paper presents a Direct Preference Optimization (DPO) based alignment framework that achieves better overall performance.<n>We analyze experimental results obtained from testing DeepSeek-R1 on our benchmark and reveal the critical ethical concerns raised by this highly acclaimed model.
arXiv Detail & Related papers (2025-01-20T06:35:01Z)
ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts.<n>Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z)
LLM-BIP: Structured Pruning for Large Language Models with Block-Wise Forward Importance Propagation [0.0]
We propose a more accurate pruning metric based on the block-wise importance score propagation.<n>We evaluate the proposed method using LLaMA-7B, Vicuna-7B, and LLaMA-13B across common zero-shot tasks.
arXiv Detail & Related papers (2024-12-09T11:57:16Z)
A deeper look at depth pruning of LLMs [49.30061112976263]
Large Language Models (LLMs) are resource-intensive to train but more costly to deploy in production. Recent work has attempted to prune blocks of LLMs based on cheap proxies for estimating block importance. We show that adaptive metrics exhibit a trade-off in performance between tasks.
arXiv Detail & Related papers (2024-07-23T08:40:27Z)
Quantifying Semantic Emergence in Language Models [31.608080868988825]
Large language models (LLMs) are widely recognized for their exceptional capacity to capture semantics meaning.<n>In this work, we introduce a quantitative metric, Information Emergence (IE), designed to measure LLMs' ability to extract semantics from input tokens.
arXiv Detail & Related papers (2024-05-21T09:12:20Z)
Mutual Wasserstein Discrepancy Minimization for Sequential Recommendation [82.0801585843835]
We propose a novel self-supervised learning framework based on Mutual WasserStein discrepancy minimization MStein for the sequential recommendation. We also propose a novel contrastive learning loss based on Wasserstein Discrepancy Measurement.
arXiv Detail & Related papers (2023-01-28T13:38:48Z)
Monotonicity and Double Descent in Uncertainty Estimation with Gaussian Processes [52.92110730286403]
It is commonly believed that the marginal likelihood should be reminiscent of cross-validation metrics and that both should deteriorate with larger input dimensions. We prove that by tuning hyper parameters, the performance, as measured by the marginal likelihood, improves monotonically with the input dimension. We also prove that cross-validation metrics exhibit qualitatively different behavior that is characteristic of double descent.
arXiv Detail & Related papers (2022-10-14T08:09:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.