Related papers: MoR: Mixture Of Representations For Mixed-Precision Training

MoR: Mixture Of Representations For Mixed-Precision Training

URL: http://arxiv.org/abs/2512.22804v1
Date: Sun, 28 Dec 2025 06:28:50 GMT
Title: MoR: Mixture Of Representations For Mixed-Precision Training
Authors: Bor-Yiing Su, Peter Dykas, Mike Chrzanowski, Jatin Chhugani,
Abstract summary: Mixture-of-Representations (MoR) is a novel, per-tensor and sub-tensor level quantization framework.<n>MoR dynamically analyzes a tensor's numerical properties to select between a variety of different representations.<n>Our initial findings show that this approach can achieve state-of-the-art results with 98.38% of tensors quantized to the FP8 format.
Score: 0.398636957150696
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mixed-precision training is a crucial technique for scaling deep learning models, but successful mixedprecision training requires identifying and applying the right combination of training methods. This paper presents our preliminary study on Mixture-of-Representations (MoR), a novel, per-tensor and sub-tensor level quantization framework that dynamically analyzes a tensor's numerical properties to select between a variety of different representations. Based on the framework, we have proposed and experimented concrete algorithms that choose dynamically between FP8 and BF16 representations for both per-tensor and sub-tensor level granularities. Our universal approach is designed to preserve model quality across various quantization partition strategies and datasets. Our initial findings show that this approach can achieve state-of-the-art results with 98.38% of tensors quantized to the FP8 format. This work highlights the potential of dynamic, property-aware quantization while preserving model quality. We believe this approach can generally improve the robustness of low precision training, as demonstrated by achieving FP8 accuracies that are on par with existing approaches without the need for fine-grain partitioning, or can be used in combination with other training methods to improve the leverage of even lower precision number formats such as NVFP4.

Related papers

Robust Variational Model Based Tailored UNet: Leveraging Edge Detector and Mean Curvature for Improved Image Segmentation [7.638424494500011]
This paper presents a robust version of Variational Model Based UNet (VM_TUNet)<n>VM_TUNet is a hybrid framework that integrates variational methods with deep learning.<n> experiments on three benchmark datasets indicate that the proposed method achieves a balanced trade-off between performance and computational efficiency.
arXiv Detail & Related papers (2025-12-08T14:33:52Z)
Mixed-Precision Quantization for Language Models: Techniques and Prospects [10.345914140081925]
Quantization has emerged as an essential compression technique to reduce model size, alleviate memory bottlenecks, and accelerate inference.<n>Mixed-precision quantization offers a promising alternative by selectively allocating precision across layers or within tensors to balance efficiency and accuracy.
arXiv Detail & Related papers (2025-10-19T12:16:40Z)
Pretraining Large Language Models with NVFP4 [53.235038214986865]
We introduce a novel approach for stable and accurate training of large language models (LLMs) using the NVFP4 format.<n>Our method integrates two-dimensional quantization scheme for consistent representations across both the forward and backward passes.<n>Our results show that the model trained with our NVFP4-based pretraining technique achieves training loss and downstream task accuracies comparable to an FP8 baseline.
arXiv Detail & Related papers (2025-09-29T17:53:17Z)
Elucidating the Design Space of FP4 training [6.963061311516306]
This paper aims to provide a unified view of the design space of textttFP4 training.<n>We introduce a comprehensive, quantisation gradient-based framework for microscaling quantization.<n>By systematically evaluating thousands of combinations of techniques, we identify which configurations offer the most favourable performance-to-overhead trade-off.
arXiv Detail & Related papers (2025-09-22T13:50:40Z)
Transferable Post-training via Inverse Value Learning [83.75002867411263]
We propose modeling changes at the logits level during post-training using a separate neural network (i.e., the value network)<n>After training this network on a small base model using demonstrations, this network can be seamlessly integrated with other pre-trained models during inference.<n>We demonstrate that the resulting value network has broad transferability across pre-trained models of different parameter sizes.
arXiv Detail & Related papers (2024-10-28T13:48:43Z)
DaWin: Training-free Dynamic Weight Interpolation for Robust Adaptation [57.11544252399801]
We propose DaWin, a training-free dynamic weight method that leverages the entropy of individual models over each unlabeled test sample.<n>We show that DaWin achieves significant performance gain in considered settings, with minimal computational overhead.
arXiv Detail & Related papers (2024-10-03T16:25:35Z)
High-Performance Few-Shot Segmentation with Foundation Models: An Empirical Study [64.06777376676513]
We develop a few-shot segmentation (FSS) framework based on foundation models. To be specific, we propose a simple approach to extract implicit knowledge from foundation models to construct coarse correspondence. Experiments on two widely used datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2024-09-10T08:04:11Z)
Training and inference of large language models using 8-bit floating point [3.689110902209004]
This paper presents a methodology to select the scalings for FP8 linear layers, based on dynamically updating per-tensor scales for the weights, gradients and activations. We apply this methodology to train and validate large language models of the type of GPT and Llama 2 using FP8, for model sizes ranging from 111M to 70B.
arXiv Detail & Related papers (2023-09-29T13:24:33Z)
Kernel Density Matrices for Probabilistic Deep Learning [8.486487001779416]
In quantum mechanics, a density matrix is the most general way to describe the state of a quantum system. This paper introduces a novel approach to probabilistic deep learning, kernel density matrices. It provides a simpler yet effective mechanism for representing joint probability distributions of both continuous and discrete random variables.
arXiv Detail & Related papers (2023-05-26T12:59:58Z)
Modular Quantization-Aware Training for 6D Object Pose Estimation [52.9436648014338]
Edge applications demand efficient 6D object pose estimation on resource-constrained embedded platforms. We introduce Modular Quantization-Aware Training (MQAT), an adaptive and mixed-precision quantization-aware training strategy. MQAT guides a systematic gradated modular quantization sequence and determines module-specific bit precisions, leading to quantized models that outperform those produced by state-of-the-art uniform and mixed-precision quantization techniques.
arXiv Detail & Related papers (2023-03-12T21:01:54Z)
Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator. In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.