A New Pipeline For Generating Instruction Dataset via RAG and Self Fine-Tuning
- URL: http://arxiv.org/abs/2408.05911v1
- Date: Mon, 12 Aug 2024 03:52:11 GMT
- Title: A New Pipeline For Generating Instruction Dataset via RAG and Self Fine-Tuning
- Authors: Chih-Wei Song, Yu-Kai Lee, Yin-Te Tsai,
- Abstract summary: This research proposes a pipeline to construct high-quality instruction datasets for fine-tuning on specific domains.
By ingesting domain-specific documents, the pipeline generates relevant and contextually appropriate instructions.
As a case study, we apply this approach to the domain of psychiatry, a field requiring specialized knowledge and sensitive handling of patient information.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rapid development of large language models in recent years, there has been an increasing demand for domain-specific Agents that can cater to the unique needs of enterprises and organizations. Unlike general models, which strive for broad coverage, these specialized Agents rely on focused datasets tailored to their intended applications. This research proposes a pipeline that leverages the power of LLMs and the Retrieval-Augmented Generation related framework to construct high-quality instruction datasets for fine-tuning on specific domains using custom document collections. By ingesting domain-specific documents, the pipeline generates relevant and contextually appropriate instructions, thus effectively creating a comprehensive dataset for fine-tuning LLMs on the target domain. This approach overcomes the limitations of traditional dataset creation methods, which often rely on manual curation or web-scraping techniques that may introduce noise and irrelevant data. Notably, our pipeline offers a dynamic solution that can quickly adapt to updates or modifications in the domain-specific document collection, eliminating the need for complete retraining. Additionally, it addresses the challenge of data scarcity by enabling the generation of instruction datasets from a limited set of initial documents, rendering it suitable for unpopular or specialized domains where comprehensive datasets are scarce. As a case study, we apply this approach to the domain of psychiatry, a field requiring specialized knowledge and sensitive handling of patient information. The resulting fine-tuned LLM demonstrates showcases the viability of the proposed approach and underscores its potential for widespread adoption across various industries and domains where tailored, accurate, and contextually relevant language models are indispensable.
Related papers
- Exploring User Retrieval Integration towards Large Language Models for Cross-Domain Sequential Recommendation [66.72195610471624]
Cross-Domain Sequential Recommendation aims to mine and transfer users' sequential preferences across different domains.
We propose a novel framework named URLLM, which aims to improve the CDSR performance by exploring the User Retrieval approach.
arXiv Detail & Related papers (2024-06-05T09:19:54Z) - ProgGen: Generating Named Entity Recognition Datasets Step-by-step with Self-Reflexive Large Language Models [25.68491572293656]
Large Language Models fall short in structured knowledge extraction tasks such as named entity recognition.
This paper explores an innovative, cost-efficient strategy to harness LLMs with modest NER capabilities for producing superior NER datasets.
arXiv Detail & Related papers (2024-03-17T06:12:43Z) - Dial-insight: Fine-tuning Large Language Models with High-Quality Domain-Specific Data Preventing Capability Collapse [4.98050508891467]
We propose a two-stage approach for the construction of production prompts designed to yield high-quality data.
This method involves the generation of a diverse array of prompts that encompass a broad spectrum of tasks and exhibit a rich variety of expressions.
We introduce a cost-effective, multi-dimensional quality assessment framework to ensure the integrity of the generated labeling data.
arXiv Detail & Related papers (2024-03-14T08:27:32Z) - LLM Based Multi-Agent Generation of Semi-structured Documents from
Semantic Templates in the Public Administration Domain [2.3999111269325266]
Large Language Models (LLMs) have enabled the creation of customized text output satisfying user requests.
We propose a novel approach that combines the LLMs with prompt engineering and multi-agent systems for generating new documents compliant with a desired structure.
arXiv Detail & Related papers (2024-02-21T13:54:53Z) - Explore-Instruct: Enhancing Domain-Specific Instruction Coverage through
Active Exploration [64.58185031596169]
Explore-Instruct is a novel approach to enhance the data coverage to be used in domain-specific instruction-tuning.
Our data-centric analysis validates the effectiveness of this proposed approach in improving domain-specific instruction coverage.
Our findings offer a promising opportunity to improve instruction coverage, especially in domain-specific contexts.
arXiv Detail & Related papers (2023-10-13T15:03:15Z) - Fine-tuning Large Enterprise Language Models via Ontological Reasoning [5.12835891233968]
Large Language Models (LLMs) exploit fine-tuning as a technique to adapt to diverse goals, thanks to task-specific training data.
We propose a novel neurosymbolic architecture that leverages the power of ontological reasoning to build task- and domain-specific corpora for LLM fine-tuning.
arXiv Detail & Related papers (2023-06-19T06:48:45Z) - An Iterative Optimizing Framework for Radiology Report Summarization with ChatGPT [80.33783969507458]
The 'Impression' section of a radiology report is a critical basis for communication between radiologists and other physicians.
Recent studies have achieved promising results in automatic impression generation using large-scale medical text data.
These models often require substantial amounts of medical text data and have poor generalization performance.
arXiv Detail & Related papers (2023-04-17T17:13:42Z) - Domain-incremental Cardiac Image Segmentation with Style-oriented Replay
and Domain-sensitive Feature Whitening [67.6394526631557]
M&Ms should incrementally learn from each incoming dataset and progressively update with improved functionality as time goes by.
In medical scenarios, this is particularly challenging as accessing or storing past data is commonly not allowed due to data privacy.
We propose a novel domain-incremental learning framework to recover past domain inputs first and then regularly replay them during model optimization.
arXiv Detail & Related papers (2022-11-09T13:07:36Z) - A Multi-Format Transfer Learning Model for Event Argument Extraction via
Variational Information Bottleneck [68.61583160269664]
Event argument extraction (EAE) aims to extract arguments with given roles from texts.
We propose a multi-format transfer learning model with variational information bottleneck.
We conduct extensive experiments on three benchmark datasets, and obtain new state-of-the-art performance on EAE.
arXiv Detail & Related papers (2022-08-27T13:52:01Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - Data Techniques For Online End-to-end Speech Recognition [17.621967685914587]
Practitioners often need to build ASR systems for new use cases in a short amount of time, given limited in-domain data.
While recently developed end-to-end methods largely simplify the modeling pipelines, they still suffer from the data sparsity issue.
We explore a few simple-to-implement techniques for building online ASR systems in an end-to-end fashion, with a small amount of transcribed data in the target domain.
arXiv Detail & Related papers (2020-01-24T22:59:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.