Translation Error Detection as Rationale Extraction
- URL: http://arxiv.org/abs/2108.12197v1
- Date: Fri, 27 Aug 2021 09:35:14 GMT
- Title: Translation Error Detection as Rationale Extraction
- Authors: Marina Fomicheva, Lucia Specia, Nikolaos Aletras
- Abstract summary: We study the behaviour of state-of-the-art sentence-level QE models and show that explanations can indeed be used to detect translation errors.
We introduce a novel semi-supervised method for word-level QE and (ii) propose to use the QE task as a new benchmark for evaluating the plausibility of feature attribution.
- Score: 36.616561917049076
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent Quality Estimation (QE) models based on multilingual pre-trained
representations have achieved very competitive results when predicting the
overall quality of translated sentences. Predicting translation errors, i.e.
detecting specifically which words are incorrect, is a more challenging task,
especially with limited amounts of training data. We hypothesize that, not
unlike humans, successful QE models rely on translation errors to predict
overall sentence quality. By exploring a set of feature attribution methods
that assign relevance scores to the inputs to explain model predictions, we
study the behaviour of state-of-the-art sentence-level QE models and show that
explanations (i.e. rationales) extracted from these models can indeed be used
to detect translation errors. We therefore (i) introduce a novel
semi-supervised method for word-level QE and (ii) propose to use the QE task as
a new benchmark for evaluating the plausibility of feature attribution, i.e.
how interpretable model explanations are to humans.
Related papers
- Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Improving Machine Translation with Human Feedback: An Exploration of Quality Estimation as a Reward Model [75.66013048128302]
In this work, we investigate the potential of employing the QE model as the reward model to predict human preferences for feedback training.
We first identify the overoptimization problem during QE-based feedback training, manifested as an increase in reward while translation quality declines.
To address the problem, we adopt a simple yet effective method that uses rules to detect the incorrect translations and assigns a penalty term to the reward scores of them.
arXiv Detail & Related papers (2024-01-23T16:07:43Z) - Towards Fine-Grained Information: Identifying the Type and Location of
Translation Errors [80.22825549235556]
Existing approaches can not synchronously consider error position and type.
We build an FG-TED model to predict the textbf addition and textbfomission errors.
Experiments show that our model can identify both error type and position concurrently, and gives state-of-the-art results.
arXiv Detail & Related papers (2023-02-17T16:20:33Z) - Pathologies of Pre-trained Language Models in Few-shot Fine-tuning [50.3686606679048]
We show that pre-trained language models with few examples show strong prediction bias across labels.
Although few-shot fine-tuning can mitigate the prediction bias, our analysis shows models gain performance improvement by capturing non-task-related features.
These observations alert that pursuing model performance with fewer examples may incur pathological prediction behavior.
arXiv Detail & Related papers (2022-04-17T15:55:18Z) - Classification-based Quality Estimation: Small and Efficient Models for
Real-world Applications [29.380675447523817]
Sentence-level Quality estimation (QE) of machine translation is traditionally formulated as a regression task.
Recent QE models have achieved previously-unseen levels of correlation with human judgments.
We evaluate several model compression techniques for QE and find that, despite their popularity in other NLP tasks, they lead to poor performance in this regression setting.
arXiv Detail & Related papers (2021-09-17T16:14:52Z) - NoiER: An Approach for Training more Reliable Fine-TunedDownstream Task
Models [54.184609286094044]
We propose noise entropy regularisation (NoiER) as an efficient learning paradigm that solves the problem without auxiliary models and additional data.
The proposed approach improved traditional OOD detection evaluation metrics by 55% on average compared to the original fine-tuned models.
arXiv Detail & Related papers (2021-08-29T06:58:28Z) - MDQE: A More Accurate Direct Pretraining for Machine Translation Quality
Estimation [4.416484585765028]
We argue that there are still gaps between the predictor and the estimator in both data quality and training objectives.
We propose a novel framework that provides a more accurate direct pretraining for QE tasks.
arXiv Detail & Related papers (2021-07-24T09:48:37Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.