A New Tool for Efficiently Generating Quality Estimation Datasets
- URL: http://arxiv.org/abs/2111.00767v1
- Date: Mon, 1 Nov 2021 08:37:30 GMT
- Title: A New Tool for Efficiently Generating Quality Estimation Datasets
- Authors: Sugyeong Eo, Chanjun Park, Jaehyung Seo, Hyeonseok Moon, Heuiseok Lim
- Abstract summary: Building data for quality estimation (QE) training is expensive and requires significant human labor.
We propose a fully automatic pseudo-QE dataset generation tool that generates QE datasets by receiving only monolingual or parallel corpus as the input.
- Score: 1.1374578778690623
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Building of data for quality estimation (QE) training is expensive and
requires significant human labor. In this study, we focus on a data-centric
approach while performing QE, and subsequently propose a fully automatic
pseudo-QE dataset generation tool that generates QE datasets by receiving only
monolingual or parallel corpus as the input. Consequently, the QE performance
is enhanced either by data augmentation or by encouraging multiple language
pairs to exploit the applicability of QE. Further, we intend to publicly
release this user friendly QE dataset generation tool as we believe this tool
provides a new, inexpensive method to the community for developing QE datasets.
Related papers
- Automatic Question-Answer Generation for Long-Tail Knowledge [65.11554185687258]
We propose an automatic approach to generate specialized QA datasets for tail entities.
We conduct extensive experiments by employing pretrained LLMs on our newly generated long-tail QA datasets.
arXiv Detail & Related papers (2024-03-03T03:06:31Z) - QASnowball: An Iterative Bootstrapping Framework for High-Quality
Question-Answering Data Generation [67.27999343730224]
We introduce an iterative bootstrapping framework for QA data augmentation (named QASnowball)
QASnowball can iteratively generate large-scale high-quality QA data based on a seed set of supervised examples.
We conduct experiments in the high-resource English scenario and the medium-resource Chinese scenario, and the experimental results show that the data generated by QASnowball can facilitate QA models.
arXiv Detail & Related papers (2023-09-19T05:20:36Z) - PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - QUAK: A Synthetic Quality Estimation Dataset for Korean-English Neural
Machine Translation [5.381552585149967]
Quality estimation (QE) aims to automatically predict the quality of machine translation (MT) output without reference sentences.
Despite its high utility in the real world, there remain several limitations concerning manual QE data creation.
We present QUAK, a Korean-English synthetic QE dataset generated in a fully automatic manner.
arXiv Detail & Related papers (2022-09-30T07:47:44Z) - Image Quality Assessment: Integrating Model-Centric and Data-Centric
Approaches [20.931709027443706]
Learning-based image quality assessment (IQA) has made remarkable progress in the past decade.
Nearly all consider the two key components -- model and data -- in isolation.
arXiv Detail & Related papers (2022-07-29T16:23:57Z) - QAFactEval: Improved QA-Based Factual Consistency Evaluation for
Summarization [116.56171113972944]
We show that carefully choosing the components of a QA-based metric is critical to performance.
Our solution improves upon the best-performing entailment-based metric and achieves state-of-the-art performance.
arXiv Detail & Related papers (2021-12-16T00:38:35Z) - MDQE: A More Accurate Direct Pretraining for Machine Translation Quality
Estimation [4.416484585765028]
We argue that there are still gaps between the predictor and the estimator in both data quality and training objectives.
We propose a novel framework that provides a more accurate direct pretraining for QE tasks.
arXiv Detail & Related papers (2021-07-24T09:48:37Z) - DirectQE: Direct Pretraining for Machine Translation Quality Estimation [41.187833219223336]
We argue that there are gaps between the predictor and the estimator in both data quality and training objectives.
We propose a novel framework called DirectQE that provides a direct pretraining for QE tasks.
arXiv Detail & Related papers (2021-05-15T06:18:49Z) - Generating Diverse and Consistent QA pairs from Contexts with
Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts.
Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z) - Template-Based Question Generation from Retrieved Sentences for Improved
Unsupervised Question Answering [98.48363619128108]
We propose an unsupervised approach to training QA models with generated pseudo-training data.
We show that generating questions for QA training by applying a simple template on a related, retrieved sentence rather than the original context sentence improves downstream QA performance.
arXiv Detail & Related papers (2020-04-24T17:57:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.