PheMT: A Phenomenon-wise Dataset for Machine Translation Robustness on
User-Generated Contents
- URL: http://arxiv.org/abs/2011.02121v1
- Date: Wed, 4 Nov 2020 04:44:47 GMT
- Title: PheMT: A Phenomenon-wise Dataset for Machine Translation Robustness on
User-Generated Contents
- Authors: Ryo Fujii, Masato Mita, Kaori Abe, Kazuaki Hanawa, Makoto Morishita,
Jun Suzuki and Kentaro Inui
- Abstract summary: We present a new dataset, PheMT, for evaluating the robustness of MT systems against specific linguistic phenomena in Japanese-English translation.
Our experiments with the created dataset revealed that not only our in-house models but even widely used off-the-shelf systems are greatly disturbed by the presence of certain phenomena.
- Score: 40.25277134147149
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural Machine Translation (NMT) has shown drastic improvement in its quality
when translating clean input, such as text from the news domain. However,
existing studies suggest that NMT still struggles with certain kinds of input
with considerable noise, such as User-Generated Contents (UGC) on the Internet.
To make better use of NMT for cross-cultural communication, one of the most
promising directions is to develop a model that correctly handles these
expressions. Though its importance has been recognized, it is still not clear
as to what creates the great gap in performance between the translation of
clean input and that of UGC. To answer the question, we present a new dataset,
PheMT, for evaluating the robustness of MT systems against specific linguistic
phenomena in Japanese-English translation. Our experiments with the created
dataset revealed that not only our in-house models but even widely used
off-the-shelf systems are greatly disturbed by the presence of certain
phenomena.
Related papers
- Code-Switching with Word Senses for Pretraining in Neural Machine
Translation [107.23743153715799]
We introduce Word Sense Pretraining for Neural Machine Translation (WSP-NMT)
WSP-NMT is an end-to-end approach for pretraining multilingual NMT models leveraging word sense-specific information from Knowledge Bases.
Our experiments show significant improvements in overall translation quality.
arXiv Detail & Related papers (2023-10-21T16:13:01Z) - Towards Effective Disambiguation for Machine Translation with Large
Language Models [65.80775710657672]
We study the capabilities of large language models to translate "ambiguous sentences"
Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions.
arXiv Detail & Related papers (2023-09-20T22:22:52Z) - Towards Reliable Neural Machine Translation with Consistency-Aware
Meta-Learning [24.64700139151659]
Current Neural machine translation (NMT) systems suffer from a lack of reliability.
We present a consistency-aware meta-learning (CAML) framework derived from the model-agnostic meta-learning (MAML) algorithm to address it.
We conduct experiments on the NIST Chinese to English task, three WMT translation tasks, and the TED M2O task.
arXiv Detail & Related papers (2023-03-20T09:41:28Z) - When Does Translation Require Context? A Data-driven, Multilingual
Exploration [71.43817945875433]
proper handling of discourse significantly contributes to the quality of machine translation (MT)
Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation.
We develop the Multilingual Discourse-Aware benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena.
arXiv Detail & Related papers (2021-09-15T17:29:30Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z) - Better Neural Machine Translation by Extracting Linguistic Information
from BERT [4.353029347463806]
Adding linguistic information to neural machine translation (NMT) has mostly focused on using point estimates from pre-trained models.
We augment NMT by extracting dense fine-tuned vector-based linguistic information from BERT instead of using point estimates.
arXiv Detail & Related papers (2021-04-07T00:03:51Z) - Sentence Boundary Augmentation For Neural Machine Translation Robustness [11.290581889247983]
We show that sentence boundary segmentation has the largest impact on quality, and we develop a simple data augmentation strategy to improve segmentation robustness.
We show that sentence boundary segmentation has the largest impact on quality, and we develop a simple data augmentation strategy to improve segmentation robustness.
arXiv Detail & Related papers (2020-10-21T16:44:48Z) - Assessing the Bilingual Knowledge Learned by Neural Machine Translation
Models [72.56058378313963]
We bridge the gap by assessing the bilingual knowledge learned by NMT models with phrase table.
We find that NMT models learn patterns from simple to complex and distill essential bilingual knowledge from the training examples.
arXiv Detail & Related papers (2020-04-28T03:44:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.