Towards Federated Foundation Models: Scalable Dataset Pipelines for
Group-Structured Learning
- URL: http://arxiv.org/abs/2307.09619v2
- Date: Fri, 22 Dec 2023 02:14:19 GMT
- Title: Towards Federated Foundation Models: Scalable Dataset Pipelines for
Group-Structured Learning
- Authors: Zachary Charles, Nicole Mitchell, Krishna Pillutla, Michael Reneer,
Zachary Garrett
- Abstract summary: We introduce dataset grouper, a library to create large-scale group-structured datasets.
It enables federated learning simulation at the scale of foundation models.
- Score: 11.205441416962284
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Dataset Grouper, a library to create large-scale
group-structured (e.g., federated) datasets, enabling federated learning
simulation at the scale of foundation models. This library facilitates the
creation of group-structured versions of existing datasets based on
user-specified partitions and directly leads to a variety of useful
heterogeneous datasets that can be plugged into existing software frameworks.
Dataset Grouper offers three key advantages. First, it scales to settings where
even a single group's dataset is too large to fit in memory. Second, it
provides flexibility, both in choosing the base (non-partitioned) dataset and
in defining partitions. Finally, it is framework-agnostic. We empirically
demonstrate that Dataset Grouper enables large-scale federated language
modeling simulations on datasets that are orders of magnitude larger than in
previous work, allowing for federated training of language models with hundreds
of millions, and even billions, of parameters. Our experimental results show
that algorithms like FedAvg operate more as meta-learning methods than as
empirical risk minimization methods at this scale, suggesting their utility in
downstream personalization and task-specific adaptation. Dataset Grouper is
available at https://github.com/google-research/dataset_grouper.
Related papers
- Adapt-$\infty$: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for Lifelong Instruction Tuning.
We construct pseudo-skill clusters by grouping gradient-based sample vectors.
We select the best-performing data selector for each skill cluster from a pool of selector experts.
arXiv Detail & Related papers (2024-10-14T15:48:09Z) - FedLLM-Bench: Realistic Benchmarks for Federated Learning of Large Language Models [48.484485609995986]
Federated learning has enabled multiple parties to collaboratively train large language models without directly sharing their data (FedLLM)
There are currently no realistic datasets and benchmarks for FedLLM.
We propose FedLLM-Bench, which involves 8 training methods, 4 training datasets, and 6 evaluation metrics.
arXiv Detail & Related papers (2024-06-07T11:19:30Z) - Better Synthetic Data by Retrieving and Transforming Existing Datasets [63.875064274379824]
We introduce DataTune, a method to make better use of publicly available datasets to improve automatic dataset generation.
On a diverse set of language-based tasks, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49%.
We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks.
arXiv Detail & Related papers (2024-04-22T17:15:32Z) - Towards More Practical Group Activity Detection: A New Benchmark and Model [61.39427407758131]
Group activity detection (GAD) is the task of identifying members of each group and classifying the activity of the group at the same time in a video.
We present a new dataset, dubbed Caf'e, which presents more practical scenarios and metrics.
We also propose a new GAD model that deals with an unknown number of groups and latent group members efficiently and effectively.
arXiv Detail & Related papers (2023-12-05T16:48:17Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - Combining datasets to increase the number of samples and improve model
fitting [7.4771091238795595]
We propose a novel framework called Combine datasets based on Imputation (ComImp)
In addition, we propose a variant of ComImp that uses Principle Component Analysis (PCA), PCA-ComImp in order to reduce dimension before combining datasets.
Our results indicate that the proposed methods are somewhat similar to transfer learning in that the merge can significantly improve the accuracy of a prediction model on smaller datasets.
arXiv Detail & Related papers (2022-10-11T06:06:37Z) - Parsing with Pretrained Language Models, Multiple Datasets, and Dataset
Embeddings [13.097523786733872]
We compare two methods to embed datasets in a transformer-based multilingual dependency.
We confirm that performance increases are highest for small datasets and datasets with a low baseline score.
We show that training on the combination of all datasets performs similarly to designing smaller clusters based on language-relatedness.
arXiv Detail & Related papers (2021-12-07T10:47:07Z) - Single-dataset Experts for Multi-dataset Question Answering [6.092171111087768]
We train a network on multiple datasets to generalize and transfer better to new datasets.
Our approach is to model multi-dataset question answering with a collection of single-dataset experts.
Simple methods based on parameter-averaging lead to better zero-shot generalization and few-shot transfer performance.
arXiv Detail & Related papers (2021-09-28T17:08:22Z) - Learning Multi-Attention Context Graph for Group-Based Re-Identification [214.84551361855443]
Learning to re-identify or retrieve a group of people across non-overlapped camera systems has important applications in video surveillance.
In this work, we consider employing context information for identifying groups of people, i.e., group re-id.
We propose a novel unified framework based on graph neural networks to simultaneously address the group-based re-id tasks.
arXiv Detail & Related papers (2021-04-29T09:57:47Z) - Cross-Dataset Collaborative Learning for Semantic Segmentation [17.55660581677053]
We present a simple, flexible, and general method for semantic segmentation, termed Cross-Dataset Collaborative Learning (CDCL)
Given multiple labeled datasets, we aim to improve the generalization and discrimination of feature representations on each dataset.
We conduct extensive evaluations on four diverse datasets, i.e., Cityscapes, BDD100K, CamVid, and COCO Stuff, with single-dataset and cross-dataset settings.
arXiv Detail & Related papers (2021-03-21T09:59:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.