Leveraging Unstructured Text Data for Federated Instruction Tuning of Large Language Models
- URL: http://arxiv.org/abs/2409.07136v1
- Date: Wed, 11 Sep 2024 09:31:44 GMT
- Title: Leveraging Unstructured Text Data for Federated Instruction Tuning of Large Language Models
- Authors: Rui Ye, Rui Ge, Yuchi Fengting, Jingyi Chai, Yanfeng Wang, Siheng Chen,
- Abstract summary: Federated instruction tuning enables multiple clients to collaboratively fine-tune a shared large language model (LLM)
Existing literature impractically requires that all the clients readily hold instruction-tuning data.
We propose a novel framework FedIT-U2S, which can automatically transform unstructured corpus into structured data for federated instruction tuning.
- Score: 45.139087558425395
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Federated instruction tuning enables multiple clients to collaboratively fine-tune a shared large language model (LLM) that can follow humans' instructions without directly sharing raw data. However, existing literature impractically requires that all the clients readily hold instruction-tuning data (i.e., structured instruction-response pairs), which necessitates massive human annotations since clients' data is usually unstructured text instead. Addressing this, we propose a novel and flexible framework FedIT-U2S, which can automatically transform unstructured corpus into structured data for federated instruction tuning. FedIT-U2S consists two key steps: (1) few-shot instruction-tuning data generation, where each unstructured data piece together with several examples is combined to prompt an LLM in generating an instruction-response pair. To further enhance the flexibility, a retrieval-based example selection technique is proposed, where the examples are automatically selected based on the relatedness between the client's data piece and example pool, bypassing the need of determining examples in advance. (2) A typical federated instruction tuning process based on the generated data. Overall, FedIT-U2S can be applied to diverse scenarios as long as the client holds valuable text corpus, broadening the application scope of federated instruction tuning. We conduct a series of experiments on three domains (medicine, knowledge, and math), showing that our proposed FedIT-U2S can consistently and significantly brings improvement over the base LLM.
Related papers
- Federated Data-Efficient Instruction Tuning for Large Language Models [34.35613476734293]
Federated data-efficient instruction tuning for large language models, FedHDS, is presented.
It reduces the redundancy of data samples at both intra-client and inter-client levels.
Experiments show that FedHDS significantly reduces the amount of data required for fine-tuning while improving the responsiveness of the instruction-tuned LLMs to unseen tasks.
arXiv Detail & Related papers (2024-10-14T15:05:51Z) - SRFUND: A Multi-Granularity Hierarchical Structure Reconstruction Benchmark in Form Understanding [55.48936731641802]
We present the SRFUND, a hierarchically structured multi-task form understanding benchmark.
SRFUND provides refined annotations on top of the original FUNSD and XFUND datasets.
The dataset includes eight languages including English, Chinese, Japanese, German, French, Spanish, Italian, and Portuguese.
arXiv Detail & Related papers (2024-06-13T02:35:55Z) - Exploring Format Consistency for Instruction Tuning [79.0698403613366]
In this work, we propose a framework named Unified Instruction Tuning (UIT)
UIT calls OpenAI APIs for automatic format transfer among different instruction tuning datasets such as PromptSource, FLAN and CrossFit.
With the framework, we demonstrate the necessity of maintaining format consistency in instruction tuning; (2) improve the generalization performance on unseen instructions on T5-LM-xl; and (3) provide a novel perplexity-based denoising method to reduce the noise of automatic format transfer.
arXiv Detail & Related papers (2023-07-28T12:00:13Z) - Faithful Low-Resource Data-to-Text Generation through Cycle Training [14.375070014155817]
Methods to generate text from structured data have advanced significantly in recent years.
Cycle training uses two models which are inverses of each other.
We show that cycle training achieves nearly the same performance as fully supervised approaches.
arXiv Detail & Related papers (2023-05-24T06:44:42Z) - StructGPT: A General Framework for Large Language Model to Reason over
Structured Data [117.13986738340027]
We develop an emphIterative Reading-then-Reasoning(IRR) approach for solving question answering tasks based on structured data.
Our approach can significantly boost the performance of ChatGPT and achieve comparable performance against the full-data supervised-tuning baselines.
arXiv Detail & Related papers (2023-05-16T17:45:23Z) - Unified Text Structuralization with Instruction-tuned Language Models [28.869098023025753]
We propose a simple and efficient approach to instruct large language model (LLM) to extract a variety of structures from texts.
Experiments show that this approach can enable language models to perform comparable with other state-of-the-art methods on datasets of a variety of languages and knowledge.
arXiv Detail & Related papers (2023-03-27T07:39:05Z) - Personalized Federated Learning With Structure [24.566947384179837]
We propose a novel structured federated learning(SFL) framework to simultaneously learn the global model and personalized model.
In contrast to a pre-defined structure, our framework could be further enhanced by adding a structure learning component to automatically learn the structure.
arXiv Detail & Related papers (2022-03-02T02:43:51Z) - Neural Data-to-Text Generation with LM-based Text Augmentation [27.822282190362856]
We show that a weakly supervised training paradigm is able to outperform fully supervised seq2seq models with less than 10% annotations.
By utilizing all annotated data, our model can boost the performance of a standard seq2seq model by over 5 BLEU points.
arXiv Detail & Related papers (2021-02-06T10:21:48Z) - Substructure Substitution: Structured Data Augmentation for NLP [55.69800855705232]
SUB2 generates new examples by substituting substructures with ones with the same label.
For more general tasks, we present variations of SUB2 based on constituency parse trees.
For most cases, training with the augmented dataset by SUB2 achieves better performance than training with the original training set.
arXiv Detail & Related papers (2021-01-02T09:54:24Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.