Related papers: Prompt Public Large Language Models to Synthesize Data for Private On-device Applications

Prompt Public Large Language Models to Synthesize Data for Private On-device Applications

URL: http://arxiv.org/abs/2404.04360v2
Date: Wed, 7 Aug 2024 03:36:51 GMT
Title: Prompt Public Large Language Models to Synthesize Data for Private On-device Applications
Authors: Shanshan Wu, Zheng Xu, Yanxiang Zhang, Yuanbo Zhang, Daniel Ramage,
Abstract summary: This paper investigates how large language models (LLMs) trained on public data can improve the quality of pre-training data for the on-device language models trained with DP and FL. The model pre-trained on our synthetic dataset achieves relative improvement of 19.0% and 22.8% in next word prediction accuracy. Our experiments demonstrate the strengths of LLMs in synthesizing data close to the private distribution even without accessing the private data.
Score: 5.713077600587505
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pre-training on public data is an effective method to improve the performance for federated learning (FL) with differential privacy (DP). This paper investigates how large language models (LLMs) trained on public data can improve the quality of pre-training data for the on-device language models trained with DP and FL. We carefully design LLM prompts to filter and transform existing public data, and generate new data to resemble the real user data distribution. The model pre-trained on our synthetic dataset achieves relative improvement of 19.0% and 22.8% in next word prediction accuracy compared to the baseline model pre-trained on a standard public dataset, when evaluated over the real user data in Gboard (Google Keyboard, a production mobile keyboard application). Furthermore, our method achieves evaluation accuracy better than or comparable to the baseline during the DP FL fine-tuning over millions of mobile devices, and our final model outperforms the baseline in production A/B testing. Our experiments demonstrate the strengths of LLMs in synthesizing data close to the private distribution even without accessing the private data, and also suggest future research directions to further reduce the distribution gap.

Related papers

Private Federated Learning using Preference-Optimized Synthetic Data [10.143621632256448]
In practical settings, differentially private Federated learning (DP-FL) is the dominant method for training models from private, on-device client data. Recent work has suggested that DP-FL may be enhanced or outperformed by methods that use DP synthetic data. Our key insight is that the private client feedback collected by prior DP synthetic data methods can be viewed as a preference ranking.
arXiv Detail & Related papers (2025-04-23T05:57:20Z)
DataDecide: How to Predict Best Pretraining Data with Small Experiments [67.95896457895404]
We release models, data, and evaluations in DataDecide -- the most extensive open suite of models over differences in data and scale. We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds.
arXiv Detail & Related papers (2025-04-15T17:02:15Z)
Preference Curriculum: LLMs Should Always Be Pretrained on Their Preferred Data [19.221998577357713]
Large language models (LLMs) generally utilize a consistent data distribution throughout the pretraining process. As the model's capability improves, it is intuitive that its data preferences dynamically change, indicating the need for pretraining with different data at various training stages. We propose the Perplexity Difference (PD) based Preference Curriculum learning framework, which always perceives and uses the data preferred by LLMs to train and boost them.
arXiv Detail & Related papers (2025-01-21T13:12:13Z)
Privacy-Preserving Customer Churn Prediction Model in the Context of Telecommunication Industry [1.0428401220897083]
We propose a framework for privacy-preserving customer churn prediction model in the cloud environment. We have proposed a novel approach which is a combination of Generative Adversarial Networks (GANs) and adaptive Weight-of-Evidence (aWOE)
arXiv Detail & Related papers (2024-11-03T06:08:59Z)
MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models [16.654859430784825]
Current data selection methods, which rely on either hand-crafted rules or larger reference models, are conducted statically and do not capture the evolving data preferences during pretraining. We introduce model-aware data selection with data influence models (MATES), where a data influence model continuously adapts to the evolving data preferences of the pretraining model and then selects the data most effective for the current pretraining progress. Experiments of pretraining 410M and 1B models on the C4 dataset demonstrate that MATES significantly outperforms random data selection on extensive downstream tasks.
arXiv Detail & Related papers (2024-06-10T06:27:42Z)
Aligning Large Language Models with Self-generated Preference Data [72.99676237703099]
We propose a new framework that boosts the alignment of large language models (LLMs) with human preferences. Our key idea is leveraging the human prior knowledge within the small (seed) data. We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
arXiv Detail & Related papers (2024-06-06T18:01:02Z)
How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs) We find that Ask-LLM and Density sampling are the best methods in their respective categories. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z)
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN) At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z)
Harnessing large-language models to generate private synthetic text [18.863579044812703]
Differentially private training algorithms like DP-SGD protect sensitive training data by ensuring that trained models do not reveal private information. This paper studies an alternative approach to generate synthetic data that is differentially private with respect to the original data, and then non-privately training a model on the synthetic data. generating private synthetic data is much harder than training a private model.
arXiv Detail & Related papers (2023-06-02T16:59:36Z)
Can Public Large Language Models Help Private Cross-device Federated Learning? [58.05449579773249]
We study (differentially) private federated learning (FL) of language models. Public data has been used to improve privacy-utility trade-offs for both large and small language models. We propose a novel distribution matching algorithm with theoretical grounding to sample public data close to private data distribution.
arXiv Detail & Related papers (2023-05-20T07:55:58Z)
FedPDC:Federated Learning for Public Dataset Correction [1.5533842336139065]
Federated learning has lower classification accuracy than traditional machine learning in Non-IID scenarios. New algorithm FedPDC is proposed to optimize the aggregation mode of local models and the loss function of local training. In many benchmark experiments, FedPDC can effectively improve the accuracy of the global model in the case of extremely unbalanced data distribution.
arXiv Detail & Related papers (2023-02-24T08:09:23Z)
Self-Supervised Pre-Training for Transformer-Based Person Re-Identification [54.55281692768765]
Transformer-based supervised pre-training achieves great performance in person re-identification (ReID) Due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset to boost the performance. This work aims to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure.
arXiv Detail & Related papers (2021-11-23T18:59:08Z)
Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training. We experimentally verify that the new dataset can significantly improve the ability of the learned FER model. To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.