Flash Multi-Head Feed-Forward Network
- URL: http://arxiv.org/abs/2512.06989v1
- Date: Sun, 07 Dec 2025 20:50:20 GMT
- Title: Flash Multi-Head Feed-Forward Network
- Authors: Minshen Zhang, Xiang Hu, Jianguo Li, Wei Wu, Kewei Tu,
- Abstract summary: Multi-Head FFN (MH-FFN) is motivated by the structural similarity between single-head attention and FFN.<n>MH-FFN faces two challenges: memory consumption scaling with the head count, and an imbalanced ratio between the growing intermediate size and the fixed head dimension.<n>We propose Flash Multi-Head FFN (FlashMHF), with two key innovations: an I/O-aware fused kernel computing outputs online in akin to FlashAttention, and a design using dynamically weighted parallel sub-networks.
- Score: 51.82159978122374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We explore Multi-Head FFN (MH-FFN) as a replacement of FFN in the Transformer architecture, motivated by the structural similarity between single-head attention and FFN. While multi-head mechanisms enhance expressivity in attention, naively applying them to FFNs faces two challenges: memory consumption scaling with the head count, and an imbalanced ratio between the growing intermediate size and the fixed head dimension as models scale, which degrades scalability and expressive power. To address these challenges, we propose Flash Multi-Head FFN (FlashMHF), with two key innovations: an I/O-aware fused kernel computing outputs online in SRAM akin to FlashAttention, and a design using dynamically weighted parallel sub-networks to maintain a balanced ratio between intermediate and head dimensions. Validated on models from 128M to 1.3B parameters, FlashMHF consistently improves perplexity and downstream task accuracy over SwiGLU FFNs, while reducing peak memory usage by 3-5x and accelerating inference by up to 1.08x. Our work establishes the multi-head design as a superior architectural principle for FFNs, presenting FlashMHF as a powerful, efficient, and scalable alternative to FFNs in Transformers.
Related papers
- Explicit Multi-head Attention for Inter-head Interaction in Large Language Models [70.96854312026319]
Multi-head Explicit Attention (MEA) is a simple yet effective attention variant that explicitly models cross-head interaction.<n>MEA shows strong robustness in pretraining, which allows the use of larger learning rates that lead to faster convergence.<n>This enables a practical key-value cache compression strategy that reduces KV-cache memory usage by 50% with negligible performance loss.
arXiv Detail & Related papers (2026-01-27T13:45:03Z) - Mosaic: Unlocking Long-Context Inference for Diffusion LLMs via Global Memory Planning and Dynamic Peak Taming [34.16016695663811]
Diffusion-based large language models (dLLMs) have emerged as a promising paradigm, utilizing simultaneous denoising to enable global planning and iterative refinement.<n>Existing inference systems are ill-suited for this paradigm due to severe system inefficiencies.<n>We propose Mosaic, a memory-efficient inference system that shifts from local, static management to a global, dynamic paradigm.
arXiv Detail & Related papers (2026-01-10T13:17:08Z) - MIDUS: Memory-Infused Depth Up-Scaling [20.802982614533615]
Depth Up-Scaling (DUS) has emerged as a promising strategy by duplicating layers and applying Continual Pre-training (CPT)<n>We introduce Memory-Infused Depth Up-Scaling (MIDUS), which replaces FFNs in duplicated blocks with a head-wise memory layer.<n>Our findings establish MIDUS as a compelling and resource-efficient alternative to conventional FFN replication for depth up-scaling.
arXiv Detail & Related papers (2025-12-15T05:50:45Z) - Attention Is Not All You Need: The Importance of Feedforward Networks in Transformer Models [0.0]
State-of-the-art models can have over a hundred transformer blocks, containing billions of trainable parameters, and are trained on trillions of tokens of text.<n>We show that models using a transformer block configuration with three-layer FFNs with fewer such blocks outperform the standard two-layer configuration delivering lower training loss with fewer total parameters in less time.
arXiv Detail & Related papers (2025-05-10T12:54:21Z) - Adapter-X: A Novel General Parameter-Efficient Fine-Tuning Framework for Vision [52.80792724919329]
We introduce a novel framework named Adapter-X to improve fine-tuning in 2D image and 3D point cloud modalities.
It is the first to outperform full fine-tuning in both 2D image and 3D point cloud modalities with significantly fewer parameters, i.e., only 0.20% and 1.88% of original trainable parameters for 2D and 3D classification tasks.
arXiv Detail & Related papers (2024-06-05T08:26:44Z) - A Lightweight Attention-based Deep Network via Multi-Scale Feature Fusion for Multi-View Facial Expression Recognition [2.9581436761331017]
We introduce a lightweight attentional network incorporating multi-scale feature fusion (LANMSFF) to tackle these issues.<n>We present two novel components, namely mass attention (MassAtt) and point wise feature selection (PWFS) blocks.<n>Our proposed approach achieved results comparable to state-of-the-art methods in terms of parameter count and robustness to pose variation.
arXiv Detail & Related papers (2024-03-21T11:40:51Z) - PartialFormer: Modeling Part Instead of Whole for Machine Translation [40.67489864907433]
We introduce PartialFormer, a parameter-efficient Transformer architecture utilizing multiple smaller FFNs.
These smaller FFNs are integrated into a multi-head attention mechanism for effective collaboration.
Experiments on 9 translation tasks and 1 abstractive summarization task validate the effectiveness of our PartialFormer approach.
arXiv Detail & Related papers (2023-10-23T13:25:54Z) - One Wide Feedforward is All You Need [3.043080042012617]
The Transformer architecture has two main non-embedding components: Attention and the Feed Forward Network (FFN)
In this work we explore the role of the FFN, and find that despite taking up a significant fraction of the model's parameters, it is highly redundant.
We are able to substantially reduce the number of parameters with only a modest drop in accuracy by removing the FFN on the decoder layers and sharing a single FFN across the encoder.
arXiv Detail & Related papers (2023-09-04T21:30:21Z) - MF-NeRF: Memory Efficient NeRF with Mixed-Feature Hash Table [62.164549651134465]
We propose MF-NeRF, a memory-efficient NeRF framework that employs a Mixed-Feature hash table to improve memory efficiency and reduce training time while maintaining reconstruction quality.
Our experiments with state-of-the-art Instant-NGP, TensoRF, and DVGO, indicate our MF-NeRF could achieve the fastest training time on the same GPU hardware with similar or even higher reconstruction quality.
arXiv Detail & Related papers (2023-04-25T05:44:50Z) - Inception Transformer [151.939077819196]
Inception Transformer, or iFormer, learns comprehensive features with both high- and low-frequency information in visual data.
We benchmark the iFormer on a series of vision tasks, and showcase that it achieves impressive performance on image classification, COCO detection and ADE20K segmentation.
arXiv Detail & Related papers (2022-05-25T17:59:54Z) - MicroNet: Towards Image Recognition with Extremely Low FLOPs [117.96848315180407]
MicroNet is an efficient convolutional neural network using extremely low computational cost.
A family of MicroNets achieve a significant performance gain over the state-of-the-art in the low FLOP regime.
For instance, MicroNet-M1 achieves 61.1% top-1 accuracy on ImageNet classification with 12 MFLOPs, outperforming MobileNetV3 by 11.3%.
arXiv Detail & Related papers (2020-11-24T18:59:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.