Related papers: Memory-Efficient Training with In-Place FFT Implementation

Memory-Efficient Training with In-Place FFT Implementation

URL: http://arxiv.org/abs/2511.01385v1
Date: Mon, 03 Nov 2025 09:36:11 GMT
Title: Memory-Efficient Training with In-Place FFT Implementation
Authors: Xinyu Ding, Bangtian Liu, Siyu Liao, Zhongfeng Wang,
Abstract summary: Existing implementations, including standard FFT and real FFT, cannot achieve true in-place computation.<n>We propose the first real-domain, fully in-place FFT framework (rdFFT) that preserves input-output memory space consistency.
Score: 5.474695910716561
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fast Fourier Transforms (FFT) are widely used to reduce memory and computational costs in deep learning. However, existing implementations, including standard FFT and real FFT (rFFT), cannot achieve true in-place computation. In particular, rFFT maps an input of size n to a complex output of size n/2+1, causing dimensional mismatch and requiring additional memory allocation. We propose the first real-domain, fully in-place FFT framework (rdFFT) that preserves input-output memory space consistency. By leveraging butterfly operation symmetry and conjugate properties in the frequency domain, we design an implicit complex encoding scheme that eliminates intermediate cache usage entirely. Experiments on multiple natural language understanding tasks demonstrate the method effectiveness in reducing training memory cost, offering a promising direction for frequency-domain lightweight adaptation.

Related papers

Feature-Modulated UFNO for Improved Prediction of Multiphase Flow in Porous Media [0.39146761527401425]
We introduce UFNO-FiLM, an enhanced architecture that incorporates two key innovations.<n>First, we decouple scalar inputs from spatial features using a Feature-wise Linear Modulation layer.<n>Second, we employ a spatially weighted loss function that prioritizes learning in critical regions.
arXiv Detail & Related papers (2025-11-25T17:44:28Z)
TNT: Improving Chunkwise Training for Test-Time Memorization [62.78875147721906]
Recurrent neural networks (RNNs) with deep test-time memorization modules, such as Titans and TTT, represent a promising, linearly-scaling paradigm distinct from Transformers.<n>We introduce TNT, a novel training paradigm that decouples training efficiency from inference performance through a two-stage process.<n>TNT achieves a substantial acceleration in training speed-up to 17 times faster than the most accurate baseline configuration.
arXiv Detail & Related papers (2025-11-10T17:45:09Z)
Sprint: Sparse-Dense Residual Fusion for Efficient Diffusion Transformers [91.02299679350834]
Diffusion Transformers (DiTs) deliver state-of-the-art generative performance but their quadratic training cost with sequence length makes large-scale pretraining prohibitively expensive.<n>We present Sparse--Dense Residual Fusion for Efficient Diffusion Transformers, a simple method that enables aggressive token dropping (up to 75%) while preserving quality.
arXiv Detail & Related papers (2025-10-24T19:29:55Z)
Orthogonal Finetuning Made Scalable [92.34573849209238]
Orthogonal finetuning (OFT) offers highly parameter-efficient adaptation while preventing catastrophic forgetting, but its high runtime and memory demands limit practical deployment.<n>We identify the core computational bottleneck in OFT as its weight-centric implementation, which relies on costly matrix-matrix multiplications with cubic complexity.<n>We propose OFTv2, an input-centric reformulation that instead uses matrix-vector multiplications (i.e., matrix-free computation), reducing the computational cost to quadratic.<n>These modifications allow OFTv2 to achieve up to 10x faster training and 3x lower GPU memory usage without
arXiv Detail & Related papers (2025-06-24T17:59:49Z)
Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache [67.47789629197857]
We propose a training-free framework that exploits the heterogeneous roles of transformer head dimensions.<n>By projecting the long-context-insensitive dimensions onto Fourier bases, FourierAttention approximates their temporal evolution with fixed-length spectral coefficients.<n>We show that FourierAttention achieves the best long-context accuracy on LongBench and Needle-In-A-Haystack.
arXiv Detail & Related papers (2025-06-13T15:35:54Z)
When Foresight Pruning Meets Zeroth-Order Optimization: Efficient Federated Learning for Low-Memory Devices [36.23767349592602]
Federated Learning (FL) enables collaborative learning in Artificial Intelligence of Things (AIoT) design. FL fails to work on low-memory AIoT devices due to its heavy memory usage. We propose a federated foresight pruning method based on Neural Tangent Kernel (NTK), which can seamlessly integrate with federated BP-Free training frameworks.
arXiv Detail & Related papers (2024-05-08T02:24:09Z)
Towards A Unified View of Sparse Feed-Forward Network in Pretraining Large Language Model [58.9100867327305]
Large and sparse feed-forward layers (S-FFN) have proven effective in scaling up Transformers model size for textitpretraining large language models. We analyzed two major design choices of S-FFN: the memory block (a.k.a. expert) size and the memory block selection method. We found a simpler selection method -- textbftextttAvg-K that selects blocks through their mean aggregated hidden states, achieving lower perplexity in language model pretraining.
arXiv Detail & Related papers (2023-05-23T12:28:37Z)
Transform Once: Efficient Operator Learning in Frequency Domain [69.74509540521397]
We study deep neural networks designed to harness the structure in frequency domain for efficient learning of long-range correlations in space or time. This work introduces a blueprint for frequency domain learning through a single transform: transform once (T1)
arXiv Detail & Related papers (2022-11-26T01:56:05Z)
Fast Partial Fourier Transform [28.36925669222461]
Fast Fourier transform (FFT) is a widely used algorithm that computes the discrete Fourier transform in many machine learning applications. Despite its pervasive use, all known FFT algorithms do not provide a fine-tuning option for the user to specify one's demand. In this paper, we propose a fast Partial Fourier Transform (PFT), a careful modification of the Cooley-Tukey algorithm that enables one to specify an arbitrary consecutive range where the coefficients should be computed.
arXiv Detail & Related papers (2020-08-28T10:01:49Z)
Acceleration of Convolutional Neural Network Using FFT-Based Split Convolutions [11.031841470875571]
Convolutional neural networks (CNNs) have a large number of variables and hence suffer from a complexity problem for their implementation. Recent studies on Fast Fourier Transform (FFT) based CNN aiming at simplifying the computations required for FFT. In this paper, a new method for CNN processing in the FFT domain is proposed, which is based on input splitting.
arXiv Detail & Related papers (2020-03-27T20:16:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.