Related papers: Scaling Federated Learning for Fine-tuning of Large Language Models

Scaling Federated Learning for Fine-tuning of Large Language Models

URL: http://arxiv.org/abs/2102.00875v1
Date: Mon, 1 Feb 2021 14:31:39 GMT
Title: Scaling Federated Learning for Fine-tuning of Large Language Models
Authors: Agrin Hilmkil and Sebastian Callh and Matteo Barbieri and Leon Ren\'e S\"utfeld and Edvin Listo Zec and Olof Mogren
Abstract summary: Federated learning (FL) is a promising approach to distributed compute, as well as distributed data, and provides a level of privacy and compliance to legal frameworks. In this paper, we explore the fine-tuning of Transformer-based language models in a federated learning setting. We perform an extensive sweep over the number of clients, ranging up to 32, to evaluate the impact of distributed compute on task performance.
Score: 0.5405981353784006
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Federated learning (FL) is a promising approach to distributed compute, as well as distributed data, and provides a level of privacy and compliance to legal frameworks. This makes FL attractive for both consumer and healthcare applications. While the area is actively being explored, few studies have examined FL in the context of larger language models and there is a lack of comprehensive reviews of robustness across tasks, architectures, numbers of clients, and other relevant factors. In this paper, we explore the fine-tuning of Transformer-based language models in a federated learning setting. We evaluate three popular BERT-variants of different sizes (BERT, ALBERT, and DistilBERT) on a number of text classification tasks such as sentiment analysis and author identification. We perform an extensive sweep over the number of clients, ranging up to 32, to evaluate the impact of distributed compute on task performance in the federated averaging setting. While our findings suggest that the large sizes of the evaluated models are not generally prohibitive to federated training, we found that the different models handle federated averaging to a varying degree. Most notably, DistilBERT converges significantly slower with larger numbers of clients, and under some circumstances, even collapses to chance level performance. Investigating this issue presents an interesting perspective for future research.

Related papers

One-Shot Clustering for Federated Learning [2.8060709233558647]
One-Shot Clustered Federated Learning (OCFL) is a clustering-agnostic algorithm that can automatically detect the earliest suitable moment for clustering. Our algorithm is based on the computation of cosine similarity between gradients of the clients and a temperature measure that detects when the federated model starts to converge.
arXiv Detail & Related papers (2025-03-06T09:12:43Z)
Scalable Vertical Federated Learning via Data Augmentation and Amortized Inference [1.912429179274357]
This paper introduces the first comprehensive framework for fitting Bayesian models in the Vertical Federated Learning setting. We present an innovative model formulation for specific VFL scenarios where the joint likelihood factorizes into a product of client-specific likelihoods. Our work paves the way for privacy-preserving, decentralized Bayesian inference in vertically partitioned data scenarios.
arXiv Detail & Related papers (2024-05-07T06:29:06Z)
Exploiting Label Skews in Federated Learning with Model Concatenation [39.38427550571378]
Federated Learning (FL) has emerged as a promising solution to perform deep learning on different data owners without exchanging raw data. Among different non-IID types, label skews have been challenging and common in image classification and other tasks. We propose FedConcat, a simple and effective approach that degrades these local models as the base of the global model.
arXiv Detail & Related papers (2023-12-11T10:44:52Z)
The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z)
FedSampling: A Better Sampling Strategy for Federated Learning [81.85411484302952]
Federated learning (FL) is an important technique for learning models from decentralized data in a privacy-preserving way. Existing FL methods usually uniformly sample clients for local model learning in each round. We propose a novel data uniform sampling strategy for federated learning (FedSampling)
arXiv Detail & Related papers (2023-06-25T13:38:51Z)
Confidence-aware Personalized Federated Learning via Variational Expectation Maximization [34.354154518009956]
We present a novel framework for personalized Federated Learning (PFL) PFL is a distributed learning scheme to train a shared model across clients. We present a novel framework for PFL based on hierarchical modeling and variational inference.
arXiv Detail & Related papers (2023-05-21T20:12:27Z)
When Do Curricula Work in Federated Learning? [56.88941905240137]
We find that curriculum learning largely alleviates non-IIDness. The more disparate the data distributions across clients the more they benefit from learning. We propose a novel client selection technique that benefits from the real-world disparity in the clients.
arXiv Detail & Related papers (2022-12-24T11:02:35Z)
No One Left Behind: Inclusive Federated Learning over Heterogeneous Devices [79.16481453598266]
We propose InclusiveFL, a client-inclusive federated learning method to handle this problem. The core idea of InclusiveFL is to assign models of different sizes to clients with different computing capabilities. We also propose an effective method to share the knowledge among multiple local models with different sizes.
arXiv Detail & Related papers (2022-02-16T13:03:27Z)
Splitfed learning without client-side synchronization: Analyzing client-side split network portion size to overall performance [4.689140226545214]
Federated Learning (FL), Split Learning (SL), and SplitFed Learning (SFL) are three recent developments in distributed machine learning. This paper studies SFL without client-side model synchronization. It provides only 1%-2% better accuracy than Multi-head Split Learning on the MNIST test set.
arXiv Detail & Related papers (2021-09-19T22:57:23Z)
Unifying Distillation with Personalization in Federated Learning [1.8262547855491458]
Federated learning (FL) is a decentralized privacy-preserving learning technique in which clients learn a joint collaborative model through a central aggregator without sharing their data. In this setting, all clients learn a single common predictor (FedAvg), which does not generalize well on each client's local data due to the statistical data heterogeneity among clients. In this paper, we address this problem with PersFL, a two-stage personalized learning algorithm. In the first stage, PersFL finds the optimal teacher model of each client during the FL training phase. In the second stage, PersFL distills the useful knowledge from
arXiv Detail & Related papers (2021-05-31T17:54:29Z)
WAFFLe: Weight Anonymized Factorization for Federated Learning [88.44939168851721]
In domains where data are sensitive or private, there is great value in methods that can learn in a distributed manner without the data ever leaving the local devices. We propose Weight Anonymized Factorization for Federated Learning (WAFFLe), an approach that combines the Indian Buffet Process with a shared dictionary of weight factors for neural networks.
arXiv Detail & Related papers (2020-08-13T04:26:31Z)
Parameter Space Factorization for Zero-Shot Learning across Tasks and Languages [112.65994041398481]
We propose a Bayesian generative model for the space of neural parameters. We infer the posteriors over such latent variables based on data from seen task-language combinations. Our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods.
arXiv Detail & Related papers (2020-01-30T16:58:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.