Classification-based Quality Estimation: Small and Efficient Models for
Real-world Applications
- URL: http://arxiv.org/abs/2109.08627v1
- Date: Fri, 17 Sep 2021 16:14:52 GMT
- Title: Classification-based Quality Estimation: Small and Efficient Models for
Real-world Applications
- Authors: Shuo Sun, Ahmed El-Kishky, Vishrav Chaudhary, James Cross, Francisco
Guzm\'an, Lucia Specia
- Abstract summary: Sentence-level Quality estimation (QE) of machine translation is traditionally formulated as a regression task.
Recent QE models have achieved previously-unseen levels of correlation with human judgments.
We evaluate several model compression techniques for QE and find that, despite their popularity in other NLP tasks, they lead to poor performance in this regression setting.
- Score: 29.380675447523817
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sentence-level Quality estimation (QE) of machine translation is
traditionally formulated as a regression task, and the performance of QE models
is typically measured by Pearson correlation with human labels. Recent QE
models have achieved previously-unseen levels of correlation with human
judgments, but they rely on large multilingual contextualized language models
that are computationally expensive and make them infeasible for real-world
applications. In this work, we evaluate several model compression techniques
for QE and find that, despite their popularity in other NLP tasks, they lead to
poor performance in this regression setting. We observe that a full model
parameterization is required to achieve SoTA results in a regression task.
However, we argue that the level of expressiveness of a model in a continuous
range is unnecessary given the downstream applications of QE, and show that
reframing QE as a classification problem and evaluating QE models using
classification metrics would better reflect their actual performance in
real-world applications.
Related papers
- Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment.
To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Reassessing Evaluation Practices in Visual Question Answering: A Case
Study on Out-of-Distribution Generalization [27.437077941786768]
Vision-and-language (V&L) models pretrained on large-scale multimodal data have demonstrated strong performance on various tasks.
We evaluate two pretrained V&L models under different settings by conducting cross-dataset evaluations.
We find that these models tend to learn to solve the benchmark, rather than learning the high-level skills required by the VQA task.
arXiv Detail & Related papers (2022-05-24T16:44:45Z) - Translation Error Detection as Rationale Extraction [36.616561917049076]
We study the behaviour of state-of-the-art sentence-level QE models and show that explanations can indeed be used to detect translation errors.
We introduce a novel semi-supervised method for word-level QE and (ii) propose to use the QE task as a new benchmark for evaluating the plausibility of feature attribution.
arXiv Detail & Related papers (2021-08-27T09:35:14Z) - Knowledge Distillation for Quality Estimation [79.51452598302934]
Quality Estimation (QE) is the task of automatically predicting Machine Translation quality in the absence of reference translations.
Recent success in QE stems from the use of multilingual pre-trained representations, where very large models lead to impressive results.
We show that this approach, in combination with data augmentation, leads to light-weight QE models that perform competitively with distilled pre-trained representations with 8x fewer parameters.
arXiv Detail & Related papers (2021-07-01T12:36:21Z) - Learning to Perturb Word Embeddings for Out-of-distribution QA [55.103586220757464]
We propose a simple yet effective DA method based on a noise generator, which learns to perturb the word embedding of the input questions and context without changing their semantics.
We validate the performance of the QA models trained with our word embedding on a single source dataset, on five different target domains.
Notably, the model trained with ours outperforms the model trained with more than 240K artificially generated QA pairs.
arXiv Detail & Related papers (2021-05-06T14:12:26Z) - Study on the Assessment of the Quality of Experience of Streaming Video [117.44028458220427]
In this paper, the influence of various objective factors on the subjective estimation of the QoE of streaming video is studied.
The paper presents standard and handcrafted features, shows their correlation and p-Value of significance.
We take SQoE-III database, so far the largest and most realistic of its kind.
arXiv Detail & Related papers (2020-12-08T18:46:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.