Tailoring Domain Adaptation for Machine Translation Quality Estimation
- URL: http://arxiv.org/abs/2304.08891v2
- Date: Tue, 9 May 2023 08:34:19 GMT
- Title: Tailoring Domain Adaptation for Machine Translation Quality Estimation
- Authors: Javad Pourmostafa Roshan Sharami, Dimitar Shterionov, Fr\'ed\'eric
Blain, Eva Vanmassenhove, Mirella De Sisto, Chris Emmery, Pieter Spronck
- Abstract summary: This paper combines domain adaptation and data augmentation within a robust QE system.
We show a significant improvement for all the language pairs investigated, better cross-lingual inference, and a superior performance in zero-shot learning scenarios.
- Score: 1.8780017602640042
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While quality estimation (QE) can play an important role in the translation
process, its effectiveness relies on the availability and quality of training
data. For QE in particular, high-quality labeled data is often lacking due to
the high cost and effort associated with labeling such data. Aside from the
data scarcity challenge, QE models should also be generalizable, i.e., they
should be able to handle data from different domains, both generic and
specific. To alleviate these two main issues -- data scarcity and domain
mismatch -- this paper combines domain adaptation and data augmentation within
a robust QE system. Our method first trains a generic QE model and then
fine-tunes it on a specific domain while retaining generic knowledge. Our
results show a significant improvement for all the language pairs investigated,
better cross-lingual inference, and a superior performance in zero-shot
learning scenarios as compared to state-of-the-art baselines.
Related papers
- Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction [54.23208041792073]
Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review.
A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods.
We propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels.
arXiv Detail & Related papers (2024-06-26T05:30:21Z) - AI-Driven Frameworks for Enhancing Data Quality in Big Data Ecosystems: Error_Detection, Correction, and Metadata Integration [0.0]
This thesis proposes a novel set of interconnected frameworks aimed at enhancing big data quality comprehensively.
Firstly, we introduce new quality metrics and a weighted scoring system for precise data quality assessment.
Thirdly, we present a generic framework for detecting various quality anomalies using AI models.
arXiv Detail & Related papers (2024-05-06T21:36:45Z) - Dial-insight: Fine-tuning Large Language Models with High-Quality Domain-Specific Data Preventing Capability Collapse [4.98050508891467]
We propose a two-stage approach for the construction of production prompts designed to yield high-quality data.
This method involves the generation of a diverse array of prompts that encompass a broad spectrum of tasks and exhibit a rich variety of expressions.
We introduce a cost-effective, multi-dimensional quality assessment framework to ensure the integrity of the generated labeling data.
arXiv Detail & Related papers (2024-03-14T08:27:32Z) - Language Modelling Approaches to Adaptive Machine Translation [0.0]
Consistency is a key requirement of high-quality translation.
In-domain data scarcity is common in translation settings.
Can we employ language models to improve the quality of adaptive MT at inference time?
arXiv Detail & Related papers (2024-01-25T23:02:54Z) - PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - A Data-centric Framework for Improving Domain-specific Machine Reading
Comprehension Datasets [5.673449249014538]
Low-quality data can cause downstream problems in high-stakes applications.
Data-centric approach emphasizes on improving dataset quality to enhance model performance.
arXiv Detail & Related papers (2023-04-02T08:26:38Z) - Original or Translated? On the Use of Parallel Data for Translation
Quality Estimation [81.27850245734015]
We demonstrate a significant gap between parallel data and real QE data.
For parallel data, it is indiscriminate and the translationese may occur on either source or target side.
We find that using the source-original part of parallel corpus consistently outperforms its target-original counterpart.
arXiv Detail & Related papers (2022-12-20T14:06:45Z) - Learning to Perturb Word Embeddings for Out-of-distribution QA [55.103586220757464]
We propose a simple yet effective DA method based on a noise generator, which learns to perturb the word embedding of the input questions and context without changing their semantics.
We validate the performance of the QA models trained with our word embedding on a single source dataset, on five different target domains.
Notably, the model trained with ours outperforms the model trained with more than 240K artificially generated QA pairs.
arXiv Detail & Related papers (2021-05-06T14:12:26Z) - Generating Diverse and Consistent QA pairs from Contexts with
Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts.
Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z) - Logic-Guided Data Augmentation and Regularization for Consistent
Question Answering [55.05667583529711]
This paper addresses the problem of improving the accuracy and consistency of responses to comparison questions.
Our method leverages logical and linguistic knowledge to augment labeled training data and then uses a consistency-based regularizer to train the model.
arXiv Detail & Related papers (2020-04-21T17:03:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.