A Data-centric Framework for Improving Domain-specific Machine Reading
Comprehension Datasets
- URL: http://arxiv.org/abs/2304.00483v2
- Date: Fri, 26 May 2023 05:43:19 GMT
- Title: A Data-centric Framework for Improving Domain-specific Machine Reading
Comprehension Datasets
- Authors: Iva Bojic, Josef Halim, Verena Suharman, Sreeja Tar, Qi Chwen Ong, Duy
Phung, Mathieu Ravaut, Shafiq Joty, Josip Car
- Abstract summary: Low-quality data can cause downstream problems in high-stakes applications.
Data-centric approach emphasizes on improving dataset quality to enhance model performance.
- Score: 5.673449249014538
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Low-quality data can cause downstream problems in high-stakes applications.
Data-centric approach emphasizes on improving dataset quality to enhance model
performance. High-quality datasets are needed for general-purpose Large
Language Models (LLMs) training, as well as for domain-specific models, which
are usually small in size as it is costly to engage a large number of domain
experts for their creation. Thus, it is vital to ensure high-quality
domain-specific training data. In this paper, we propose a framework for
enhancing the data quality of original datasets. We applied the proposed
framework to four biomedical datasets and showed relative improvement of up to
33%/40% for fine-tuning of retrieval/reader models on the BioASQ dataset when
using back translation to enhance the original dataset quality.
Related papers
- Little Giants: Synthesizing High-Quality Embedding Data at Scale [71.352883755806]
We introduce SPEED, a framework that aligns open-source small models to efficiently generate large-scale embedding data.
SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data.
arXiv Detail & Related papers (2024-10-24T10:47:30Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Leveraging Web-Crawled Data for High-Quality Fine-Tuning [24.19939701706869]
We argue that web-crawled data can still serve as a valuable source for high-quality supervised fine-tuning without relying on advanced models like GPT-4.
We create a paired training dataset automatically by aligning web-crawled data with a smaller set of high-quality data.
Our experiments show that training with the model-transformed data yields better results, surpassing training with only high-quality data by an average score of 9.4% in Chinese math problems.
arXiv Detail & Related papers (2024-08-15T08:12:52Z) - Enhancing Data Quality in Federated Fine-Tuning of Foundation Models [54.757324343062734]
We propose a data quality control pipeline for federated fine-tuning of foundation models.
This pipeline computes scores reflecting the quality of training data and determines a global threshold for a unified standard.
Our experiments show that the proposed quality control pipeline facilitates the effectiveness and reliability of the model training, leading to better performance.
arXiv Detail & Related papers (2024-03-07T14:28:04Z) - An Integrated Data Processing Framework for Pretraining Foundation Models [57.47845148721817]
Researchers and practitioners often have to manually curate datasets from difference sources.
We propose a data processing framework that integrates a Processing Module and an Analyzing Module.
The proposed framework is easy to use and highly flexible.
arXiv Detail & Related papers (2024-02-26T07:22:51Z) - On the Impact of Cross-Domain Data on German Language Models [20.758967185444416]
We present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data.
Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks.
Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to $4.45%$ over the previous state-of-the-art.
arXiv Detail & Related papers (2023-10-11T09:09:55Z) - Assessing Dataset Quality Through Decision Tree Characteristics in
Autoencoder-Processed Spaces [0.30458514384586394]
We show the profound impact of dataset quality on model training and performance.
Our findings underscore the importance of appropriate feature selection, adequate data volume, and data quality.
This research offers valuable insights into data assessment practices, contributing to the development of more accurate and robust machine learning models.
arXiv Detail & Related papers (2023-06-27T11:33:31Z) - Expanding Small-Scale Datasets with Guided Imagination [92.5276783917845]
dataset expansion is a new task aimed at expanding a ready-to-use small dataset by automatically creating new labeled samples.
GIF conducts data imagination by optimizing the latent features of the seed data in the semantically meaningful space of the prior model.
GIF-SD obtains 13.5% higher model accuracy on natural image datasets than unguided expansion with SD.
arXiv Detail & Related papers (2022-11-25T09:38:22Z) - A Proposal to Study "Is High Quality Data All We Need?" [8.122270502556374]
We propose an empirical study that examines how to select a subset of and/or create high quality benchmark data.
We seek to answer if big datasets are truly needed to learn a task, and whether a smaller subset of high quality data can replace big datasets.
arXiv Detail & Related papers (2022-03-12T10:50:13Z) - A Data-Centric Approach for Training Deep Neural Networks with Less Data [1.9014535120129343]
This paper summarizes our winning submission to the "Data-Centric AI" competition.
We discuss some of the challenges that arise while training with a small dataset.
We propose a GAN-based solution for synthesizing new data points.
arXiv Detail & Related papers (2021-10-07T16:41:52Z) - Adaptive Weighting Scheme for Automatic Time-Series Data Augmentation [79.47771259100674]
We present two sample-adaptive automatic weighting schemes for data augmentation.
We validate our proposed methods on a large, noisy financial dataset and on time-series datasets from the UCR archive.
On the financial dataset, we show that the methods in combination with a trading strategy lead to improvements in annualized returns of over 50$%$, and on the time-series data we outperform state-of-the-art models on over half of the datasets, and achieve similar performance in accuracy on the others.
arXiv Detail & Related papers (2021-02-16T17:50:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.