Related papers: Dispersion Loss Counteracts Embedding Condensation and Improves Generalization in Small Language Models

Dispersion Loss Counteracts Embedding Condensation and Improves Generalization in Small Language Models

URL: http://arxiv.org/abs/2602.00217v1
Date: Fri, 30 Jan 2026 16:07:03 GMT
Title: Dispersion Loss Counteracts Embedding Condensation and Improves Generalization in Small Language Models
Authors: Chen Liu, Xingzhi Sun, Xi Xiao, Alexandre Van Tassel, Ke Xu, Kristof Reimann, Danqi Liao, Mark Gerstein, Tianyang Wang, Xiao Wang, Smita Krishnaswamy,
Abstract summary: Large language models (LLMs) achieve remarkable performance through ever-increasing parameter counts, but scaling incurs steep computational costs.<n>We study representational differences between LLMs and their smaller counterparts, with the goal of replicating the representational qualities of larger models in the smaller models.<n>We show that small models such as $textttGPT2$ and $textttQwen3-0.6B$ exhibit severe condensation, whereas the larger models such as $textttGPT2-xl$ and $textttQwen3-32B
Score: 55.908141398092646
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) achieve remarkable performance through ever-increasing parameter counts, but scaling incurs steep computational costs. To better understand LLM scaling, we study representational differences between LLMs and their smaller counterparts, with the goal of replicating the representational qualities of larger models in the smaller models. We observe a geometric phenomenon which we term $\textbf{embedding condensation}$, where token embeddings collapse into a narrow cone-like subspace in some language models. Through systematic analyses across multiple Transformer families, we show that small models such as $\texttt{GPT2}$ and $\texttt{Qwen3-0.6B}$ exhibit severe condensation, whereas the larger models such as $\texttt{GPT2-xl}$ and $\texttt{Qwen3-32B}$ are more resistant to this phenomenon. Additional observations show that embedding condensation is not reliably mitigated by knowledge distillation from larger models. To fight against it, we formulate a dispersion loss that explicitly encourages embedding dispersion during training. Experiments demonstrate that it mitigates condensation, recovers dispersion patterns seen in larger models, and yields performance gains across 10 benchmarks. We believe this work offers a principled path toward improving smaller Transformers without additional parameters.

Related papers

Distributional Clarity: The Hidden Driver of RL-Friendliness in Large Language Models [50.99097734404912]
We show that RL-friendly models exhibit intra-class compactness and inter-class separation in their probability assignments to correct vs. incorrect responses.<n>Experiments across six mathematical benchmarks show consistent improvements across all model families, with gains up to 5.9 points on AIME24.
arXiv Detail & Related papers (2026-01-11T13:34:44Z)
Double Descent as a Lens for Sample Efficiency in Autoregressive vs. Discrete Diffusion Models [0.0]
In this work, we use the double descent phenomenon to holistically compare the sample efficiency of discrete diffusion and autoregressive models.<n>Our results indicate that autoregressive models are more sample-efficient on small-scale datasets, while discrete diffusion models only become competitive when given sufficient capacity and compute.
arXiv Detail & Related papers (2025-09-29T16:03:12Z)
LLM Probability Concentration: How Alignment Shrinks the Generative Horizon [13.184240238106016]
We show that alignment tuning substantially sharpens the model's output distribution from the outset.<n>Building on this insight, we find this consistency has surprising implications for complex reasoning.
arXiv Detail & Related papers (2025-06-22T02:00:37Z)
A Convergence Theory for Diffusion Language Models: An Information-Theoretic Perspective [8.15094483029656]
Diffusion models enable parallel token sampling, leading to faster generation and eliminating left-to-right generation constraints.<n>We develop convergence guarantees for diffusion language models from an information-theoretic perspective.<n>These results offer novel theoretical insights into the practical effectiveness of diffusion language models.
arXiv Detail & Related papers (2025-05-27T16:24:20Z)
Why Do More Experts Fail? A Theoretical Analysis of Model Merging [51.18155031364046]
Model merging dramatically reduces storage and computational resources by combining multiple expert models into a single multi-task model.<n>Recent model merging methods have shown promising results, but struggle to maintain performance gains as the number of merged models increases.<n>We show that the limited effective parameter space imposes a strict constraint on the number of models that can be successfully merged.
arXiv Detail & Related papers (2025-05-27T14:10:46Z)
LOTOS: Layer-wise Orthogonalization for Training Robust Ensembles [13.776549741449557]
We study the effect of Lipschitz continuity on transferability rates. We introduce LOTOS, a new training paradigm for ensembles, which counteracts this adverse effect.
arXiv Detail & Related papers (2024-10-07T15:43:28Z)
Unleashing the Power of One-Step Diffusion based Image Super-Resolution via a Large-Scale Diffusion Discriminator [81.81748032199813]
Diffusion models have demonstrated excellent performance for real-world image super-resolution (Real-ISR)<n>We propose a new One-Step textbfDiffusion model with a larger-scale textbfDiscriminator for SR.<n>Our discriminator is able to distill noisy features from any time step of diffusion models in the latent space.
arXiv Detail & Related papers (2024-10-05T16:41:36Z)
Reducing Spatial Fitting Error in Distillation of Denoising Diffusion Models [13.364271265023953]
Knowledge distillation for diffusion models is an effective method to address this limitation with a shortened sampling process. We attribute the degradation to the spatial fitting error occurring in the training of both the teacher and student model. SFERD utilizes attention guidance from the teacher model and a designed semantic gradient predictor to reduce the student's fitting error. We achieve an FID of 5.31 on CIFAR-10 and 9.39 on ImageNet 64$times$64 with only one step, outperforming existing diffusion methods.
arXiv Detail & Related papers (2023-11-07T09:19:28Z)
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution [67.9215891673174]
We propose score entropy as a novel loss that naturally extends score matching to discrete spaces. We test our Score Entropy Discrete Diffusion models on standard language modeling tasks.
arXiv Detail & Related papers (2023-10-25T17:59:12Z)
Soft Mixture Denoising: Beyond the Expressive Bottleneck of Diffusion Models [76.46246743508651]
We show that current diffusion models actually have an expressive bottleneck in backward denoising. We introduce soft mixture denoising (SMD), an expressive and efficient model for backward denoising.
arXiv Detail & Related papers (2023-09-25T12:03:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.