Task Oriented In-Domain Data Augmentation
- URL: http://arxiv.org/abs/2406.16694v1
- Date: Mon, 24 Jun 2024 14:58:11 GMT
- Title: Task Oriented In-Domain Data Augmentation
- Authors: Xiao Liang, Xinyu Hu, Simiao Zuo, Yeyun Gong, Qiang Lou, Yi Liu, Shao-Lun Huang, Jian Jiao,
- Abstract summary: Large Language Models (LLMs) have shown superior performance in various applications and fields.
To achieve better performance on specialized domains such as law and advertisement, LLMs are often continue pre-trained on in-domain data.
We propose TRAIT, a task-oriented in-domain data augmentation framework.
- Score: 38.525017729123114
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have shown superior performance in various applications and fields. To achieve better performance on specialized domains such as law and advertisement, LLMs are often continue pre-trained on in-domain data. However, existing approaches suffer from two major issues. First, in-domain data are scarce compared with general domain-agnostic data. Second, data used for continual pre-training are not task-aware, such that they may not be helpful to downstream applications. We propose TRAIT, a task-oriented in-domain data augmentation framework. Our framework is divided into two parts: in-domain data selection and task-oriented synthetic passage generation. The data selection strategy identifies and selects a large amount of in-domain data from general corpora, and thus significantly enriches domain knowledge in the continual pre-training data. The synthetic passages contain guidance on how to use domain knowledge to answer questions about downstream tasks. By training on such passages, the model aligns with the need of downstream applications. We adapt LLMs to two domains: advertisement and math. On average, TRAIT improves LLM performance by 8% in the advertisement domain and 7.5% in the math domain.
Related papers
- DoPAMine: Domain-specific Pre-training Adaptation from seed-guided data Mining [2.1534028009401713]
Large Language Models (LLMs) have shown ability to generalize effectively across numerous industry domains.
LLMs exhibit limitations when tasked with performing in specialized or low-resource industry domains.
In this work, we propose an automated and scalable framework - DoPAMine:Domain-specific Pre-training Adaptation from seed-guided data Mining.
arXiv Detail & Related papers (2024-09-30T22:15:58Z) - How to Encode Domain Information in Relation Classification [28.006694890849374]
Current language models require a lot of training data to obtain high performance.
For Relation Classification (RC), many datasets are domain-specific.
We explore a multi-domain training setup for RC, and attempt to improve performance by encoding domain information.
arXiv Detail & Related papers (2024-04-21T20:16:35Z) - M2D2: A Massively Multi-domain Language Modeling Dataset [76.13062203588089]
We present M2D2, a fine-grained, massively multi-domain corpus for studying domain adaptation (LMs)
Using categories derived from Wikipedia and ArXiv, we organize the domains in each data source into 22 groups.
We show the benefits of adapting the LM along a domain hierarchy; adapting to smaller amounts of fine-grained domain-specific data can lead to larger in-domain performance gains.
arXiv Detail & Related papers (2022-10-13T21:34:52Z) - Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised
Pre-Training [67.71228426496013]
We show that using target domain data during pre-training leads to large performance improvements across a variety of setups.
We find that pre-training on multiple domains improves performance generalization on domains not seen during training.
arXiv Detail & Related papers (2021-04-02T12:53:15Z) - Batch Normalization Embeddings for Deep Domain Generalization [50.51405390150066]
Domain generalization aims at training machine learning models to perform robustly across different and unseen domains.
We show a significant increase in classification accuracy over current state-of-the-art techniques on popular domain generalization benchmarks.
arXiv Detail & Related papers (2020-11-25T12:02:57Z) - Multi-Domain Spoken Language Understanding Using Domain- and Task-Aware
Parameterization [78.93669377251396]
Spoken language understanding has been addressed as a supervised learning problem, where a set of training data is available for each domain.
One existing approach solves the problem by conducting multi-domain learning, using shared parameters for joint training across domains.
We propose to improve the parameterization of this method by using domain-specific and task-specific model parameters.
arXiv Detail & Related papers (2020-04-30T15:15:40Z) - Mind the Gap: Enlarging the Domain Gap in Open Set Domain Adaptation [65.38975706997088]
Open set domain adaptation (OSDA) assumes the presence of unknown classes in the target domain.
We show that existing state-of-the-art methods suffer a considerable performance drop in the presence of larger domain gaps.
We propose a novel framework to specifically address the larger domain gaps.
arXiv Detail & Related papers (2020-03-08T14:20:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.