A Data-centric Framework for Improving Domain-specific Machine Reading
Comprehension Datasets
- URL: http://arxiv.org/abs/2304.00483v2
- Date: Fri, 26 May 2023 05:43:19 GMT
- Title: A Data-centric Framework for Improving Domain-specific Machine Reading
Comprehension Datasets
- Authors: Iva Bojic, Josef Halim, Verena Suharman, Sreeja Tar, Qi Chwen Ong, Duy
Phung, Mathieu Ravaut, Shafiq Joty, Josip Car
- Abstract summary: Low-quality data can cause downstream problems in high-stakes applications.
Data-centric approach emphasizes on improving dataset quality to enhance model performance.
- Score: 5.673449249014538
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Low-quality data can cause downstream problems in high-stakes applications.
Data-centric approach emphasizes on improving dataset quality to enhance model
performance. High-quality datasets are needed for general-purpose Large
Language Models (LLMs) training, as well as for domain-specific models, which
are usually small in size as it is costly to engage a large number of domain
experts for their creation. Thus, it is vital to ensure high-quality
domain-specific training data. In this paper, we propose a framework for
enhancing the data quality of original datasets. We applied the proposed
framework to four biomedical datasets and showed relative improvement of up to
33%/40% for fine-tuning of retrieval/reader models on the BioASQ dataset when
using back translation to enhance the original dataset quality.
Related papers
- Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets.
The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method.
The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z) - Training Data for Large Language Model [2.1178416840822027]
ChatGPT surpassed previous models in terms of parameters and the scale of its pretraining corpus.
ChatGPT achieved revolutionary performance improvements through fine-tuning on a vast amount of high-quality, human-annotated data.
This paper summarizes the current state of pretraining and fine-tuning data for training large-scale language models.
arXiv Detail & Related papers (2024-11-12T11:09:58Z) - Little Giants: Synthesizing High-Quality Embedding Data at Scale [71.352883755806]
We introduce SPEED, a framework that aligns open-source small models to efficiently generate large-scale embedding data.
SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data.
arXiv Detail & Related papers (2024-10-24T10:47:30Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Leveraging Web-Crawled Data for High-Quality Fine-Tuning [24.19939701706869]
We argue that web-crawled data can still serve as a valuable source for high-quality supervised fine-tuning without relying on advanced models like GPT-4.
We create a paired training dataset automatically by aligning web-crawled data with a smaller set of high-quality data.
Our experiments show that training with the model-transformed data yields better results, surpassing training with only high-quality data by an average score of 9.4% in Chinese math problems.
arXiv Detail & Related papers (2024-08-15T08:12:52Z) - Dataset Regeneration for Sequential Recommendation [69.93516846106701]
We propose a data-centric paradigm for developing an ideal training dataset using a model-agnostic dataset regeneration framework called DR4SR.
To demonstrate the effectiveness of the data-centric paradigm, we integrate our framework with various model-centric methods and observe significant performance improvements across four widely adopted datasets.
arXiv Detail & Related papers (2024-05-28T03:45:34Z) - Enhancing Data Quality in Federated Fine-Tuning of Foundation Models [54.757324343062734]
We propose a data quality control pipeline for federated fine-tuning of foundation models.
This pipeline computes scores reflecting the quality of training data and determines a global threshold for a unified standard.
Our experiments show that the proposed quality control pipeline facilitates the effectiveness and reliability of the model training, leading to better performance.
arXiv Detail & Related papers (2024-03-07T14:28:04Z) - An Integrated Data Processing Framework for Pretraining Foundation Models [57.47845148721817]
Researchers and practitioners often have to manually curate datasets from difference sources.
We propose a data processing framework that integrates a Processing Module and an Analyzing Module.
The proposed framework is easy to use and highly flexible.
arXiv Detail & Related papers (2024-02-26T07:22:51Z) - On the Impact of Cross-Domain Data on German Language Models [20.758967185444416]
We present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data.
Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks.
Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to $4.45%$ over the previous state-of-the-art.
arXiv Detail & Related papers (2023-10-11T09:09:55Z) - Assessing Dataset Quality Through Decision Tree Characteristics in
Autoencoder-Processed Spaces [0.30458514384586394]
We show the profound impact of dataset quality on model training and performance.
Our findings underscore the importance of appropriate feature selection, adequate data volume, and data quality.
This research offers valuable insights into data assessment practices, contributing to the development of more accurate and robust machine learning models.
arXiv Detail & Related papers (2023-06-27T11:33:31Z) - A Proposal to Study "Is High Quality Data All We Need?" [8.122270502556374]
We propose an empirical study that examines how to select a subset of and/or create high quality benchmark data.
We seek to answer if big datasets are truly needed to learn a task, and whether a smaller subset of high quality data can replace big datasets.
arXiv Detail & Related papers (2022-03-12T10:50:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.