Analyzing the Use of Character-Level Translation with Sparse and Noisy
Datasets
- URL: http://arxiv.org/abs/2109.13723v1
- Date: Mon, 27 Sep 2021 07:35:47 GMT
- Title: Analyzing the Use of Character-Level Translation with Sparse and Noisy
Datasets
- Authors: J\"org Tiedemann, Preslav Nakov
- Abstract summary: We find that character-level models cut the number of untranslated words by over 40% when applied to sparse and noisy datasets.
We explore the impact of character alignment, phrase table filtering, bitext size and the choice of pivot language on translation quality.
Neither word-nor character-BLEU correlate perfectly with human judgments, due to BLEU's sensitivity to length.
- Score: 20.50917929755389
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper provides an analysis of character-level machine translation models
used in pivot-based translation when applied to sparse and noisy datasets, such
as crowdsourced movie subtitles. In our experiments, we find that such
character-level models cut the number of untranslated words by over 40% and are
especially competitive (improvements of 2-3 BLEU points) in the case of limited
training data. We explore the impact of character alignment, phrase table
filtering, bitext size and the choice of pivot language on translation quality.
We further compare cascaded translation models to the use of synthetic training
data via multiple pivots, and we find that the latter works significantly
better. Finally, we demonstrate that neither word-nor character-BLEU correlate
perfectly with human judgments, due to BLEU's sensitivity to length.
Related papers
- Improving Language Models Trained on Translated Data with Continual Pre-Training and Dictionary Learning Analysis [3.16714407449467]
We investigate the role of translation and synthetic data in training language models.
We translate TinyStories, a dataset of 2.2M short stories for 3-4 year old children, from English to Arabic using the open NLLB-3B MT model.
To rectify these issues, we pre-train the models with a small dataset of synthesized high-quality Arabic stories.
arXiv Detail & Related papers (2024-05-23T07:53:04Z) - An Empirical Study on the Robustness of Massively Multilingual Neural Machine Translation [40.08063412966712]
Massively multilingual neural machine translation (MMNMT) has been proven to enhance the translation quality of low-resource languages.
We create a robustness evaluation benchmark dataset for Indonesian-Chinese translation.
This dataset is automatically translated into Chinese using four NLLB-200 models of different sizes.
arXiv Detail & Related papers (2024-05-13T12:01:54Z) - Advancing Translation Preference Modeling with RLHF: A Step Towards
Cost-Effective Solution [57.42593422091653]
We explore leveraging reinforcement learning with human feedback to improve translation quality.
A reward model with strong language capabilities can more sensitively learn the subtle differences in translation quality.
arXiv Detail & Related papers (2024-02-18T09:51:49Z) - Character-level NMT and language similarity [1.90365714903665]
We explore the effectiveness of character-level neural machine translation for various levels of language similarity and size of the training dataset on translation between Czech and Croatian, German, Hungarian, Slovak, and Spanish.
We evaluate the models using automatic MT metrics and show that translation between similar languages benefits from character-level input segmentation.
We confirm previous findings that it is possible to close the gap by finetuning the already trained subword-level models to character-level.
arXiv Detail & Related papers (2023-08-08T17:01:42Z) - HanoiT: Enhancing Context-aware Translation via Selective Context [95.93730812799798]
Context-aware neural machine translation aims to use the document-level context to improve translation quality.
The irrelevant or trivial words may bring some noise and distract the model from learning the relationship between the current sentence and the auxiliary context.
We propose a novel end-to-end encoder-decoder model with a layer-wise selection mechanism to sift and refine the long document context.
arXiv Detail & Related papers (2023-01-17T12:07:13Z) - Separating Grains from the Chaff: Using Data Filtering to Improve
Multilingual Translation for Low-Resourced African Languages [0.6947064688250465]
This work describes our approach, which is based on filtering the given noisy data using a sentence-pair classifier.
We empirically validate our approach by evaluating on two common datasets and show that data filtering generally improves overall translation quality.
arXiv Detail & Related papers (2022-10-19T16:12:27Z) - How Robust is Neural Machine Translation to Language Imbalance in
Multilingual Tokenizer Training? [86.48323488619629]
We analyze how translation performance changes as the data ratios among languages vary in the tokenizer training corpus.
We find that while relatively better performance is often observed when languages are more equally sampled, the downstream performance is more robust to language imbalance than we usually expected.
arXiv Detail & Related papers (2022-04-29T17:50:36Z) - ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality
Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee.
It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - How to Probe Sentence Embeddings in Low-Resource Languages: On
Structural Design Choices for Probing Task Evaluation [82.96358326053115]
We investigate sensitivity of probing task results to structural design choices.
We probe embeddings in a multilingual setup with design choices that lie in a'stable region', as we identify for English.
We find that results on English do not transfer to other languages.
arXiv Detail & Related papers (2020-06-16T12:37:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.