Related papers: Mixture-of-Channels: Exploiting Sparse FFNs for Efficient LLMs Pre-Training and Inference

Mixture-of-Channels: Exploiting Sparse FFNs for Efficient LLMs Pre-Training and Inference

URL: http://arxiv.org/abs/2511.09323v1
Date: Thu, 13 Nov 2025 01:46:27 GMT
Title: Mixture-of-Channels: Exploiting Sparse FFNs for Efficient LLMs Pre-Training and Inference
Authors: Tong Wu, Yutong He, Bin Wang, Kun Yuan,
Abstract summary: Large language models (LLMs) have demonstrated remarkable success across diverse artificial intelligence tasks.<n>MoC substantially reduces activation memory during pre-training.<n>MoC delivers significant memory savings and throughput gains while maintaining competitive model performance.
Score: 16.71963410333802
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have demonstrated remarkable success across diverse artificial intelligence tasks, driven by scaling laws that correlate model size and training data with performance improvements. However, this scaling paradigm incurs substantial memory overhead, creating significant challenges for both training and inference. While existing research has primarily addressed parameter and optimizer state memory reduction, activation memory-particularly from feed-forward networks (FFNs)-has become the critical bottleneck, especially when FlashAttention is implemented. In this work, we conduct a detailed memory profiling of LLMs and identify FFN activations as the predominant source to activation memory overhead. Motivated by this, we introduce Mixture-of-Channels (MoC), a novel FFN architecture that selectively activates only the Top-K most relevant channels per token determined by SwiGLU's native gating mechanism. MoC substantially reduces activation memory during pre-training and improves inference efficiency by reducing memory access through partial weight loading into GPU SRAM. Extensive experiments validate that MoC delivers significant memory savings and throughput gains while maintaining competitive model performance.

Related papers

MemSifter: Offloading LLM Memory Retrieval via Outcome-Driven Proxy Reasoning [78.46301394559903]
Large Language Models (LLMs) are increasingly used for long-duration tasks.<n>Current methods face a trade-off between cost and accuracy.<n>MemSifter is a novel framework that offloads the memory retrieval process to a small-scale proxy model.
arXiv Detail & Related papers (2026-03-03T02:57:38Z)
MSN: A Memory-based Sparse Activation Scaling Framework for Large-scale Industrial Recommendation [19.132874291460936]
We propose MSN, a memory-based sparse activation scaling framework for recommendation models.<n> MSN retrieves personalized representations from a large parameterized memory and integrates them into downstream feature interaction modules.<n> MSN consistently improves recommendation performance while maintaining high efficiency.
arXiv Detail & Related papers (2026-02-07T12:43:51Z)
TokenSeek: Memory Efficient Fine Tuning via Instance-Aware Token Ditching [82.13572707265513]
Fine tuning has been regarded as a de facto approach for adapting large language models to downstream tasks.<n>We propose TokenSeek, a universal plugin solution for various transformer-based models through instance-aware token seeking and ditching.
arXiv Detail & Related papers (2026-01-27T15:58:36Z)
RevFFN: Memory-Efficient Full-Parameter Fine-Tuning of Mixture-of-Experts LLMs with Reversible Blocks [12.966077380225856]
RevFFN is a memory efficient fine tuning paradigm for mixture of experts (MoE) LLMs.<n>RevFFN employs carefully designed reversible Transformer blocks that allow reconstruction of layer input activations from outputs during backpropagation.
arXiv Detail & Related papers (2025-12-24T03:56:58Z)
Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction [72.27673320976933]
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding.<n>Current caching techniques accelerate decoding by storing full-layer states, yet impose substantial memory usage.<n>We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention.
arXiv Detail & Related papers (2025-08-04T16:14:03Z)
DAF: An Efficient End-to-End Dynamic Activation Framework for on-Device DNN Training [41.09085549544767]
We introduce a Dynamic Activation Framework (DAF) that enables scalable and efficient on-device training through system-level optimizations.<n>DAF achieves both memory- and time-efficient dynamic quantization training by addressing key system bottlenecks.<n> Evaluations on various deep learning models across embedded and mobile platforms demonstrate up to a $22.9times$ reduction in memory usage and a $3.2times$ speedup.
arXiv Detail & Related papers (2025-07-09T08:59:30Z)
MNN-LLM: A Generic Inference Engine for Fast Large Language Model Deployment on Mobile Devices [4.385815629175844]
MNN-LLM is a framework specifically designed to accelerate the deployment of large language models on mobile devices.<n>It addresses the runtime characteristics of LLMs through model quantization and DRAM-Flash hybrid storage.<n> Notably, MNN-LLM achieves up to a 8.6x speed increase compared to current mainstream LLM-specific frameworks.
arXiv Detail & Related papers (2025-06-12T07:45:29Z)
Quantifying Memory Utilization with Effective State-Size [73.52115209375343]
We develop a measure of textitmemory utilization'<n>This metric is tailored to the fundamental class of systems with textitinput-invariant and textitinput-varying linear operators
arXiv Detail & Related papers (2025-04-28T08:12:30Z)
MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training [24.066283519769968]
Large Language Models (LLMs) have been trained using extended context lengths to foster more creative applications.<n>We propose MEMO, a novel framework for fine-grained activation memory management.<n>MeMO achieves an average of 1.97x and 1.80x MFU compared to Megatron-LM and DeepSpeed.
arXiv Detail & Related papers (2024-07-16T18:59:49Z)
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios. In the early route, intermediate outputs are consolidated via an anti-redundancy operation. In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z)
MemLLM: Finetuning LLMs to Use An Explicit Read-Write Memory [49.96019697955383]
We introduce MemLLM, a novel method of enhancing large language models (LLMs) by integrating a structured and explicit read-and-write memory module.<n>Our experiments indicate that MemLLM enhances the LLM's performance and interpretability, in language modeling in general and knowledge-intensive tasks in particular.
arXiv Detail & Related papers (2024-04-17T18:13:16Z)
FedMef: Towards Memory-efficient Federated Dynamic Pruning [42.07105095641134]
Federated learning (FL) promotes decentralized training while prioritizing data confidentiality. Its application on resource-constrained devices is challenging due to the high demand for computation and memory resources to train deep learning models. We propose FedMef, a novel and memory-efficient federated dynamic pruning framework.
arXiv Detail & Related papers (2024-03-21T13:54:36Z)
Online Adaptation of Language Models with a Memory of Amortized Contexts [82.02369596879817]
Memory of Amortized Contexts (MAC) is an efficient and effective online adaptation framework for large language models. We show how MAC can be combined with and improve the performance of popular alternatives such as retrieval augmented generations.
arXiv Detail & Related papers (2024-03-07T08:34:57Z)
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference [23.207326766883405]
Mixture-of-Experts (MoE) is able to scale its model size without proportionally scaling up its computational requirements. Pre-gated MoE employs our novel pre-gating function which alleviates the dynamic nature of sparse expert activation. We demonstrate that Pre-gated MoE is able to improve performance, reduce GPU memory consumption, while also maintaining the same level of model quality.
arXiv Detail & Related papers (2023-08-23T11:25:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.