Evaluating MT Systems: A Theoretical Framework
- URL: http://arxiv.org/abs/2202.05806v1
- Date: Fri, 11 Feb 2022 18:05:17 GMT
- Title: Evaluating MT Systems: A Theoretical Framework
- Authors: Rajeev Sangal
- Abstract summary: This paper outlines a theoretical framework using which different automatic metrics can be designed for evaluation of Machine Translation systems.
It introduces the concept of em cognitive ease which depends on em adequacy and em lack of fluency.
It can also be used to evaluate the newer types of MT systems, such as speech to speech translation and discourse translation.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper outlines a theoretical framework using which different automatic
metrics can be designed for evaluation of Machine Translation systems. It
introduces the concept of {\em cognitive ease} which depends on {\em adequacy}
and {\em lack of fluency}. Thus, cognitive ease becomes the main parameter to
be measured rather than comprehensibility. The framework allows the components
of cognitive ease to be broken up and computed based on different linguistic
levels etc. Independence of dimensions and linearly combining them provides for
a highly modular approach.
The paper places the existing automatic methods in an overall framework, to
understand them better and to improve upon them in future. It can also be used
to evaluate the newer types of MT systems, such as speech to speech translation
and discourse translation.
Related papers
- IMTLab: An Open-Source Platform for Building, Evaluating, and Diagnosing
Interactive Machine Translation Systems [94.39110258587887]
We present IMTLab, an open-source end-to-end interactive machine translation (IMT) system platform.
IMTLab treats the whole interactive translation process as a task-oriented dialogue with a human-in-the-loop setting.
arXiv Detail & Related papers (2023-10-17T11:29:04Z) - Towards Best Practices of Activation Patching in Language Models:
Metrics and Methods [9.121998462494533]
We examine the impact of methodological details in activation patching, including evaluation metrics and corruption methods.
Backed by empirical observations, we give conceptual arguments for why certain metrics or methods may be preferred.
arXiv Detail & Related papers (2023-09-27T21:53:56Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - Towards Explainable Evaluation Metrics for Machine Translation [32.69015745456696]
We identify key properties as well as key goals of explainable machine translation metrics.
We discuss the latest state-of-the-art approaches to explainable metrics based on generative models such as ChatGPT and GPT4.
arXiv Detail & Related papers (2023-06-22T17:07:57Z) - A meta-probabilistic-programming language for bisimulation of
probabilistic and non-well-founded type systems [0.0]
We introduce a formal meta-language for probabilistic programming, capable of expressing both programs and the type systems in which they are embedded.
We draw on the frameworks of cubical type theory and dependent typed metagraphs to formalize our approach.
arXiv Detail & Related papers (2022-03-30T01:07:37Z) - When Does Translation Require Context? A Data-driven, Multilingual
Exploration [71.43817945875433]
proper handling of discourse significantly contributes to the quality of machine translation (MT)
Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation.
We develop the Multilingual Discourse-Aware benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena.
arXiv Detail & Related papers (2021-09-15T17:29:30Z) - Evaluating the Morphosyntactic Well-formedness of Generated Texts [88.20502652494521]
We propose L'AMBRE -- a metric to evaluate the morphosyntactic well-formedness of text.
We show the effectiveness of our metric on the task of machine translation through a diachronic study of systems translating into morphologically-rich languages.
arXiv Detail & Related papers (2021-03-30T18:02:58Z) - On Learning Text Style Transfer with Direct Rewards [101.97136885111037]
Lack of parallel corpora makes it impossible to directly train supervised models for the text style transfer task.
We leverage semantic similarity metrics originally used for fine-tuning neural machine translation models.
Our model provides significant gains in both automatic and human evaluation over strong baselines.
arXiv Detail & Related papers (2020-10-24T04:30:02Z) - Aspects of Terminological and Named Entity Knowledge within Rule-Based
Machine Translation Models for Under-Resourced Neural Machine Translation
Scenarios [3.413805964168321]
Rule-based machine translation is a machine translation paradigm where linguistic knowledge is encoded by an expert.
We describe different approaches to leverage the information contained in rule-based machine translation systems to improve a neural machine translation model.
Our results suggest that the proposed models have limited ability to learn from external information.
arXiv Detail & Related papers (2020-09-28T15:19:23Z) - On the Limitations of Cross-lingual Encoders as Exposed by
Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity.
This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations.
We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER.
We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z) - Can Your Context-Aware MT System Pass the DiP Benchmark Tests? :
Evaluation Benchmarks for Discourse Phenomena in Machine Translation [7.993547048820065]
We introduce the first of their kind MT benchmark datasets that aim to track and hail improvements across four main discourse phenomena.
Surprisingly, we find that existing context-aware models do not improve discourse-related translations consistently across languages and phenomena.
arXiv Detail & Related papers (2020-04-30T07:15:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.