Related papers: Towards Federated Foundation Models: Scalable Dataset Pipelines for Group-Structured Learning

Towards Federated Foundation Models: Scalable Dataset Pipelines for Group-Structured Learning

URL: http://arxiv.org/abs/2307.09619v2
Date: Fri, 22 Dec 2023 02:14:19 GMT
Title: Towards Federated Foundation Models: Scalable Dataset Pipelines for Group-Structured Learning
Authors: Zachary Charles, Nicole Mitchell, Krishna Pillutla, Michael Reneer, Zachary Garrett
Abstract summary: We introduce dataset grouper, a library to create large-scale group-structured datasets. It enables federated learning simulation at the scale of foundation models.
Score: 11.205441416962284
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce Dataset Grouper, a library to create large-scale group-structured (e.g., federated) datasets, enabling federated learning simulation at the scale of foundation models. This library facilitates the creation of group-structured versions of existing datasets based on user-specified partitions and directly leads to a variety of useful heterogeneous datasets that can be plugged into existing software frameworks. Dataset Grouper offers three key advantages. First, it scales to settings where even a single group's dataset is too large to fit in memory. Second, it provides flexibility, both in choosing the base (non-partitioned) dataset and in defining partitions. Finally, it is framework-agnostic. We empirically demonstrate that Dataset Grouper enables large-scale federated language modeling simulations on datasets that are orders of magnitude larger than in previous work, allowing for federated training of language models with hundreds of millions, and even billions, of parameters. Our experimental results show that algorithms like FedAvg operate more as meta-learning methods than as empirical risk minimization methods at this scale, suggesting their utility in downstream personalization and task-specific adaptation. Dataset Grouper is available at https://github.com/google-research/dataset_grouper.

Related papers

Adapt-$\infty$: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for Lifelong Instruction Tuning. We construct pseudo-skill clusters by grouping gradient-based sample vectors. We select the best-performing data selector for each skill cluster from a pool of selector experts.
arXiv Detail & Related papers (2024-10-14T15:48:09Z)
A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback. First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF. Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z)
FedLLM-Bench: Realistic Benchmarks for Federated Learning of Large Language Models [48.484485609995986]
Federated learning has enabled multiple parties to collaboratively train large language models without directly sharing their data (FedLLM) There are currently no realistic datasets and benchmarks for FedLLM. We propose FedLLM-Bench, which involves 8 training methods, 4 training datasets, and 6 evaluation metrics.
arXiv Detail & Related papers (2024-06-07T11:19:30Z)
Better Synthetic Data by Retrieving and Transforming Existing Datasets [63.875064274379824]
We introduce DataTune, a method to make better use of publicly available datasets to improve automatic dataset generation. On a diverse set of language-based tasks, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49%. We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks.
arXiv Detail & Related papers (2024-04-22T17:15:32Z)
Towards More Practical Group Activity Detection: A New Benchmark and Model [61.39427407758131]
Group activity detection (GAD) is the task of identifying members of each group and classifying the activity of the group at the same time in a video. We present a new dataset, dubbed Caf'e, which presents more practical scenarios and metrics. We also propose a new GAD model that deals with an unknown number of groups and latent group members efficiently and effectively.
arXiv Detail & Related papers (2023-12-05T16:48:17Z)
infoVerse: A Universal Framework for Dataset Characterization with Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization. infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information. In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z)
Combining datasets to increase the number of samples and improve model fitting [7.4771091238795595]
We propose a novel framework called Combine datasets based on Imputation (ComImp) In addition, we propose a variant of ComImp that uses Principle Component Analysis (PCA), PCA-ComImp in order to reduce dimension before combining datasets. Our results indicate that the proposed methods are somewhat similar to transfer learning in that the merge can significantly improve the accuracy of a prediction model on smaller datasets.
arXiv Detail & Related papers (2022-10-11T06:06:37Z)
Parsing with Pretrained Language Models, Multiple Datasets, and Dataset Embeddings [13.097523786733872]
We compare two methods to embed datasets in a transformer-based multilingual dependency. We confirm that performance increases are highest for small datasets and datasets with a low baseline score. We show that training on the combination of all datasets performs similarly to designing smaller clusters based on language-relatedness.
arXiv Detail & Related papers (2021-12-07T10:47:07Z)
Single-dataset Experts for Multi-dataset Question Answering [6.092171111087768]
We train a network on multiple datasets to generalize and transfer better to new datasets. Our approach is to model multi-dataset question answering with a collection of single-dataset experts. Simple methods based on parameter-averaging lead to better zero-shot generalization and few-shot transfer performance.
arXiv Detail & Related papers (2021-09-28T17:08:22Z)
Learning Multi-Attention Context Graph for Group-Based Re-Identification [214.84551361855443]
Learning to re-identify or retrieve a group of people across non-overlapped camera systems has important applications in video surveillance. In this work, we consider employing context information for identifying groups of people, i.e., group re-id. We propose a novel unified framework based on graph neural networks to simultaneously address the group-based re-id tasks.
arXiv Detail & Related papers (2021-04-29T09:57:47Z)
Cross-Dataset Collaborative Learning for Semantic Segmentation [17.55660581677053]
We present a simple, flexible, and general method for semantic segmentation, termed Cross-Dataset Collaborative Learning (CDCL) Given multiple labeled datasets, we aim to improve the generalization and discrimination of feature representations on each dataset. We conduct extensive evaluations on four diverse datasets, i.e., Cityscapes, BDD100K, CamVid, and COCO Stuff, with single-dataset and cross-dataset settings.
arXiv Detail & Related papers (2021-03-21T09:59:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.