TwIST: Rigging the Lottery in Transformers with Independent Subnetwork Training
- URL: http://arxiv.org/abs/2511.03983v1
- Date: Thu, 06 Nov 2025 02:13:24 GMT
- Title: TwIST: Rigging the Lottery in Transformers with Independent Subnetwork Training
- Authors: Michael Menezes, Barbara Su, Xinze Feng, Yehya Farhat, Hamza Shili, Anastasios Kyrillidis,
- Abstract summary: TwIST is a distributed training framework for efficient large language model sparsification.<n>It trains multipleworks in parallel, periodically aggregates their parameters, and resamples newworks during training.<n>It identifies high-qualityworks ("golden tickets") without requiring post-training procedures such as calibration or Hessian-based recovery.
- Score: 6.7228358095570995
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce TwIST, a distributed training framework for efficient large language model (LLM) sparsification. TwIST trains multiple subnetworks in parallel, periodically aggregates their parameters, and resamples new subnetworks during training. This process identifies high-quality subnetworks ("golden tickets") without requiring post-training procedures such as calibration or Hessian-based recovery. As a result, TwIST enables zero-cost pruning at deployment time while achieving perplexity competitive with state-of-the-art post-training sparsification methods. The benefits are most pronounced under aggressive sparsity (e.g., 50%+), where TwIST significantly outperforms baseline methods; for example, reaching 23.14 PPL compared to 31.64 for the closest prior approach. Unlike unstructured pruning, TwIST produces structured, dense matrices that offer practical inference speedups and memory reductions on commodity hardware (e.g., CPUs) that do not support efficient sparse computation. TwIST provides an efficient training-time path to deployable sparse LLMs without additional fine-tuning or recovery overhead.
Related papers
- Winning the Lottery by Preserving Network Training Dynamics with Concrete Ticket Search [0.5156484100374058]
Lottery Ticket Hypothesis asserts existence of highly sparse, trainableworks ('winning tickets') within dense, randomly neural networks.<n>State-of-the-art methods of drawing these tickets, like Lottery Ticket Rewinding, are computationally prohibitive.<n>In this work, we argue that PaI's reliance on first-order saliency metrics, which ignore inter-weight dependencies, contributes substantially to this performance gap.
arXiv Detail & Related papers (2025-12-08T03:48:51Z) - PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z) - LoSiA: Efficient High-Rank Fine-Tuning via Subnet Localization and Optimization [6.641493851051085]
LoSiA(Low-Resources Subnet Integration Adaptation) is an innovative method that dynamically localizes and optimize critical parameters during the training process.<n>We present LoSiA-Pro, a faster implementation of LoSiA, which reduces the training latency by about $27%$ compared to LoRA.
arXiv Detail & Related papers (2025-07-06T17:51:57Z) - UniPTS: A Unified Framework for Proficient Post-Training Sparsity [67.16547529992928]
Post-training Sparsity (PTS) is a newly emerged avenue that chases efficient network sparsity with limited data in need.
In this paper, we attempt to reconcile this disparity by transposing three cardinal factors that profoundly alter the performance of conventional sparsity into the context of PTS.
Our framework, termed UniPTS, is validated to be much superior to existing PTS methods across extensive benchmarks.
arXiv Detail & Related papers (2024-05-29T06:53:18Z) - Trainability Preserving Neural Structured Pruning [64.65659982877891]
We present trainability preserving pruning (TPP), a regularization-based structured pruning method that can effectively maintain trainability during sparsification.
TPP can compete with the ground-truth dynamical isometry recovery method on linear networks.
It delivers encouraging performance in comparison to many top-performing filter pruning methods.
arXiv Detail & Related papers (2022-07-25T21:15:47Z) - Distributed Adversarial Training to Robustify Deep Neural Networks at
Scale [100.19539096465101]
Current deep neural networks (DNNs) are vulnerable to adversarial attacks, where adversarial perturbations to the inputs can change or manipulate classification.
To defend against such attacks, an effective approach, known as adversarial training (AT), has been shown to mitigate robust training.
We propose a large-batch adversarial training framework implemented over multiple machines.
arXiv Detail & Related papers (2022-06-13T15:39:43Z) - Superposing Many Tickets into One: A Performance Booster for Sparse
Neural Network Training [32.30355584300427]
We present a novel sparse training approach, termed textbfSup-tickets, which can satisfy two desiderata concurrently in a single sparse-to-sparse training process.
Across various modern architectures on CIFAR-10/100 and ImageNet, we show that Sup-tickets integrates seamlessly with the existing sparse training methods.
arXiv Detail & Related papers (2022-05-30T16:01:32Z) - Sparsity Winning Twice: Better Robust Generalization from More Efficient
Training [94.92954973680914]
We introduce two alternatives for sparse adversarial training: (i) static sparsity and (ii) dynamic sparsity.
We find both methods to yield win-win: substantially shrinking the robust generalization gap and alleviating the robust overfitting.
Our approaches can be combined with existing regularizers, establishing new state-of-the-art results in adversarial training.
arXiv Detail & Related papers (2022-02-20T15:52:08Z) - FreeTickets: Accurate, Robust and Efficient Deep Ensemble by Training
with Dynamic Sparsity [74.58777701536668]
We introduce the FreeTickets concept, which can boost the performance of sparse convolutional neural networks over their dense network equivalents by a large margin.
We propose two novel efficient ensemble methods with dynamic sparsity, which yield in one shot many diverse and accurate tickets "for free" during the sparse training process.
arXiv Detail & Related papers (2021-06-28T10:48:20Z) - [Reproducibility Report] Rigging the Lottery: Making All Tickets Winners [1.6884611234933766]
$textitRigL$, a sparse training algorithm, claims to directly train sparse networks that match or exceed the performance of existing dense-to-sparse training techniques.
We implement $textitRigL$ from scratch in Pytorch and reproduce its performance on CIFAR-10 within 0.1% of the reported value.
arXiv Detail & Related papers (2021-03-29T17:01:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.