Related papers: TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies

TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies

URL: http://arxiv.org/abs/2511.23225v1
Date: Fri, 28 Nov 2025 14:33:21 GMT
Title: TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies
Authors: Guang Liang, Jie Shao, Ningyuan Tang, Xinyao Liu, Jianxin Wu,
Abstract summary: This paper challenges the conventional wisdom that outliers are data-driven.<n>We propose TWEO (Transformers Without Extreme Outliers), a novel, non-invasive loss function.<n>TWEO effectively prevents extreme outliers via a very simple loss term, which reduces outliers from 10000+ to less than 20.
Score: 15.045348948724884
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Native FP8 support in modern hardware is essential for training large Transformers, but is severely hindered by extreme activation outliers. Existing solutions either rely on complex mixed-precision engineering or invasive architectural modifications. This paper fundamentally challenges the conventional wisdom that outliers are data-driven. We demonstrate that extreme outliers are a data-independent, mechanically-produced artifact of training, originating from specific structural properties of the weight matrices (i.e., colinearity). Based on this insight, we propose TWEO (Transformers Without Extreme Outliers), a novel, non-invasive loss function. TWEO effectively prevents extreme outliers via a very simple loss term, which reduces outliers from 10000+ to less than 20. TWEO then enables full-model FP8 pre-training with neither engineering tricks nor architectural changes for both LLM and ViT. When standard FP8 training catastrophically collapses, TWEO achieves performance comparable to the BF16 baseline while delivering a 36% increase in training throughput. Also, TWEO enables a new quantization paradigm. Hardware-friendly W8A8 per-tensor static quantization of LLMs, previously considered completely unusable due to outliers, achieves SOTA performance for the first time on TWEO-trained models.

Related papers

MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling [29.545879706181974]
Current frameworks preserve training performance using mixed-granularity quantization, i.e., applying per-group quantization for activations and per-tensor/block quantization for weights.<n>We propose MOSS, a novel FP8 training framework that ensures both efficiency and numerical stability.
arXiv Detail & Related papers (2025-11-08T02:51:26Z)
PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z)
Born a Transformer -- Always a Transformer? On the Effect of Pretraining on Architectural Abilities [58.742178800799614]
We study a family of $textitretrieval$ and $textitcopying$ tasks inspired by Liu et al.<n>We observe an $textitinduction-versus-anti-induction$ asymmetry, where pretrained models are better at retrieving tokens to the right (induction) than the left (anti-induction) of a query token.<n>Mechanistic analysis reveals that this asymmetry is connected to the differences in the strength of induction versus anti-induction circuits within pretrained transformers.
arXiv Detail & Related papers (2025-05-27T21:36:50Z)
Towards Fully FP8 GEMM LLM Training at Scale [77.97607456493257]
Existing approaches often rely on suboptimal fine-grained FP8 kernels or fall back to higher-precision matrix multiplications.<n>We introduce a new class of LLM architectures that, for the first time, support FP8 computation for all GEMMs within transformer blocks during both forward and backward passes.<n>This enables unprecedented throughput gains, particularly at scale, while matching the downstream performance of standard BF16 training.
arXiv Detail & Related papers (2025-05-26T21:04:14Z)
Accurate INT8 Training Through Dynamic Block-Level Fallback [21.808835887740543]
Transformer models have achieved remarkable success across various AI applications but face significant training costs.<n>We propose Fallback Quantization, implementing mixed-precision GEMM that dynamically falls back 8-bit to 16-bit for activation blocks containing outliers.<n> Experiments show that our approach is robustly competent in both fine-tuning and pretraining settings.
arXiv Detail & Related papers (2025-03-11T04:45:48Z)
Optimizing Large Language Model Training Using FP4 Quantization [73.55459961002371]
Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce costs.<n>This work introduces the first FP4 training framework for large language models (LLMs)
arXiv Detail & Related papers (2025-01-28T18:04:50Z)
Scaling FP8 training to trillion-token LLMs [26.195547788434908]
We train large language models using FP8 precision on datasets up to 2 trillion tokens.<n>We uncover critical instabilities in FP8 training that were not observable in earlier works with shorter durations.<n>We introduce Smooth-SwiGLU, a novel modification that ensures stable FP8 training without altering function.
arXiv Detail & Related papers (2024-09-19T07:15:58Z)
To FP8 and Back Again: Quantifying Reduced Precision Effects on LLM Training Stability [7.115739465137031]
BrainFloat16 (BF16) precision has become the de facto standard for large language model pretraining.<n>However, prior experience with FP16, which was found to be less stable than BF16, raises concerns as to whether FP8 can be a cost-effective option for LLM training.<n>We propose new evaluation techniques and a new metric for quantifying loss landscape sharpness in autoregressive language models.
arXiv Detail & Related papers (2024-05-29T02:42:23Z)
How Do Nonlinear Transformers Learn and Generalize in In-Context Learning? [82.51626700527837]
Transformer-based large language models displayed impressive in-context learning capabilities, where a pre-trained model can handle new tasks without fine-tuning. We analyze how the mechanics of how Transformer to achieve ICL contribute to the technical challenges of the training problems in Transformers.
arXiv Detail & Related papers (2024-02-23T21:07:20Z)
Understanding the Difficulty of Training Transformers [120.99980924577787]
We show that unbalanced gradients are not the root cause of the instability of training. We propose Admin to stabilize the early stage's training and unleash its full potential in the late stage.
arXiv Detail & Related papers (2020-04-17T13:59:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.