Block Circulant Adapter for Large Language Models
- URL: http://arxiv.org/abs/2505.00582v1
- Date: Thu, 01 May 2025 15:14:32 GMT
- Title: Block Circulant Adapter for Large Language Models
- Authors: Xinyu Ding, Meiqi Wang, Siyu Liao, Zhongfeng Wang,
- Abstract summary: Fine-tuning large language models (LLMs) is difficult due to their huge model size.<n>Recent Fourier domain-based methods show potential for reducing fine-tuning costs.<n>We propose a block circulant-based matrix fine-tuning method with a stable training to leverage the properties of circulant matrices.
- Score: 10.353352027807272
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fine-tuning large language models (LLMs) is difficult due to their huge model size. Recent Fourier domain-based methods show potential for reducing fine-tuning costs. We propose a block circulant matrix-based fine-tuning method with a stable training heuristic to leverage the properties of circulant matrices and one-dimensional Fourier transforms to reduce storage and computation costs. Experiments show that our method uses $14\times$ less number of parameters than VeRA, $16\times$ smaller than LoRA and $32\times$ less FLOPs than FourierFT, while maintaining close or better task performance. Our approach presents a promising way in frequency domain to fine-tune large models on downstream tasks.
Related papers
- Parameter-Efficient Fine-Tuning with Circulant and Diagonal Vectors [8.351342832510262]
We propose to further reduce the complexity by the factorization through the product of interleaved circulant and diagonal matrices.<n>Our method achieves similar or better performance across various tasks with much less floating-point operations (FLOPs) and the number of trainable parameters.
arXiv Detail & Related papers (2025-05-01T15:11:46Z) - Sparse Matrix in Large Language Model Fine-tuning [1.9874264019909988]
We introduce a method for selecting sparse sub-matrices that aim to minimize the performance gap between PEFT vs. full fine-tuning.
In experiments, we demonstrate that our method consistently surpasses other PEFT baselines.
We also examine how the performance of LoRA and DoRA tends to plateau and decline as the number of trainable parameters increases.
arXiv Detail & Related papers (2024-05-24T13:12:14Z) - Parameter-Efficient Fine-Tuning with Discrete Fourier Transform [26.563344030824414]
Low-rank adaptation(LoRA) has recently gained much interest in fine-tuning foundation models.
We introduce FourierFT, which treats $Delta W$ as a matrix in the spatial domain and learns only a small fraction of its spectral coefficients.
Our method shows comparable or better performance with fewer parameters than LoRA on various tasks.
arXiv Detail & Related papers (2024-05-05T17:15:24Z) - ReFT: Representation Finetuning for Language Models [74.51093640257892]
We develop a family of Representation Finetuning (ReFT) methods.
ReFTs operate on a frozen base model and learn task-specific interventions on hidden representations.
We showcase LoReFT on eight commonsense reasoning tasks, four arithmetic reasoning tasks, instruction-tuning, and GLUE.
arXiv Detail & Related papers (2024-04-04T17:00:37Z) - HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z) - From PEFT to DEFT: Parameter Efficient Finetuning for Reducing Activation Density in Transformers [52.199303258423306]
We propose a novel density loss that encourages higher activation sparsity in pre-trained models.
Our proposed method, textbfDEFT, can consistently reduce activation density by up to textbf44.94% on RoBERTa$_mathrmLarge$ and by textbf53.19% (encoder density) and textbf90.60% (decoder density) on Flan-T5$_mathrmXXL$.
arXiv Detail & Related papers (2024-02-02T21:25:46Z) - FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency
Trade-off in Language Model Inference [57.119047493787185]
This paper shows how to reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
In practice, our method can reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
arXiv Detail & Related papers (2024-01-08T17:29:16Z) - MatFormer: Nested Transformer for Elastic Inference [91.45687988953435]
MatFormer is a novel Transformer architecture designed to provide elastic inference across diverse deployment constraints.<n>MatFormer achieves this by incorporating a nested Feed Forward Network (FFN) block structure within a standard Transformer model.<n>We show that a 850M decoder-only MatFormer language model (MatLM) allows us to extract multiple smaller models spanning from 582M to 850M parameters.
arXiv Detail & Related papers (2023-10-11T17:57:14Z) - Integrated Variational Fourier Features for Fast Spatial Modelling with Gaussian Processes [7.5991638205413325]
For $N$ training points, exact inference has $O(N3)$ cost; with $M ll N$ features, state of the art sparse variational methods have $O(NM2)$ cost.
Recently, methods have been proposed using more sophisticated features; these promise $O(M3)$ cost, with good performance in low dimensional tasks such as spatial modelling, but they only work with a very limited class of kernels, excluding some of the most commonly used.
In this work, we propose integrated Fourier features, which extends these performance benefits to a very broad class of stationary co
arXiv Detail & Related papers (2023-08-27T15:44:28Z) - Fourier Transformer: Fast Long Range Modeling by Removing Sequence
Redundancy with FFT Operator [24.690247474891958]
Fourier Transformer is able to significantly reduce computational costs while retain the ability to inherit from various large pretrained models.
Our model achieves state-of-the-art performances among all transformer-based models on the long-range modeling benchmark LRA.
For generative seq-to-seq tasks including CNN/DailyMail and ELI5, by inheriting the BART weights our model outperforms the standard BART.
arXiv Detail & Related papers (2023-05-24T12:33:06Z) - Does Continual Learning Equally Forget All Parameters? [55.431048995662714]
Distribution shift (e.g., task or domain shift) in continual learning (CL) usually results in catastrophic forgetting of neural networks.
We study which modules in neural networks are more prone to forgetting by investigating their training dynamics during CL.
We propose a more efficient and simpler method that entirely removes the every-step replay and replaces them by only $k$-times of FPF periodically triggered during CL.
arXiv Detail & Related papers (2023-04-09T04:36:24Z) - Learning Decorrelated Representations Efficiently Using Fast Fourier
Transform [3.932322649674071]
We propose a relaxed decorrelating regularizer that can be computed in O(n d log d) time by Fast Fourier Transform.
The proposed regularizer exhibits accuracy comparable to that of existing regularizers in downstream tasks, whereas their training requires less memory and is faster for large d.
arXiv Detail & Related papers (2023-01-04T12:38:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.