On the Difficulty of Translating Free-Order Case-Marking Languages
- URL: http://arxiv.org/abs/2107.06055v1
- Date: Tue, 13 Jul 2021 13:09:58 GMT
- Title: On the Difficulty of Translating Free-Order Case-Marking Languages
- Authors: Arianna Bisazza, Ahmet \"Ust\"un, Stephan Sportel
- Abstract summary: We investigate whether free-order case-marking languages are more difficult to translate by state-of-the-art Neural Machine Translation models (NMT)
We find that word order flexibility in the source language only leads to a very small loss of NMT quality.
In medium- and low-resource settings, the overall NMT quality of fixed-order languages remains unmatched.
- Score: 2.9434930072968584
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Identifying factors that make certain languages harder to model than others
is essential to reach language equality in future Natural Language Processing
technologies. Free-order case-marking languages, such as Russian, Latin or
Tamil, have proved more challenging than fixed-order languages for the tasks of
syntactic parsing and subject-verb agreement prediction. In this work, we
investigate whether this class of languages is also more difficult to translate
by state-of-the-art Neural Machine Translation models (NMT). Using a variety of
synthetic languages and a newly introduced translation challenge set, we find
that word order flexibility in the source language only leads to a very small
loss of NMT quality, even though the core verb arguments become impossible to
disambiguate in sentences without semantic cues. The latter issue is indeed
solved by the addition of case marking. However, in medium- and low-resource
settings, the overall NMT quality of fixed-order languages remains unmatched.
Related papers
- Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.
We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.
We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z) - Extending Multilingual Machine Translation through Imitation Learning [60.15671816513614]
Imit-MNMT treats the task as an imitation learning process, which mimicks the behavior of an expert.
We show that our approach significantly improves the translation performance between the new and the original languages.
We also demonstrate that our approach is capable of solving copy and off-target problems.
arXiv Detail & Related papers (2023-11-14T21:04:03Z) - Improving Cross-Lingual Transfer through Subtree-Aware Word Reordering [17.166996956587155]
One obstacle for effective cross-lingual transfer is variability in word-order patterns.
We present a new powerful reordering method, defined in terms of Universal Dependencies.
We show that our method consistently outperforms strong baselines over different language pairs and model architectures.
arXiv Detail & Related papers (2023-10-20T15:25:53Z) - CODET: A Benchmark for Contrastive Dialectal Evaluation of Machine
Translation [31.18983138590214]
Neural machine translation (NMT) systems exhibit limited robustness in handling source-side linguistic variations.
CODET is a contrastive dialectal benchmark encompassing 891 different variations from twelve different languages.
We quantitatively demonstrate the challenges large MT models face in effectively translating dialectal variants.
arXiv Detail & Related papers (2023-05-26T21:24:00Z) - On the Copying Problem of Unsupervised NMT: A Training Schedule with a
Language Discriminator Loss [120.19360680963152]
unsupervised neural machine translation (UNMT) has achieved success in many language pairs.
The copying problem, i.e., directly copying some parts of the input sentence as the translation, is common among distant language pairs.
We propose a simple but effective training schedule that incorporates a language discriminator loss.
arXiv Detail & Related papers (2023-05-26T18:14:23Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - CLSE: Corpus of Linguistically Significant Entities [58.29901964387952]
We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts.
CLSE covers 74 different semantic types to support various applications from airline ticketing to video games.
We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
arXiv Detail & Related papers (2022-11-04T12:56:12Z) - How do lexical semantics affect translation? An empirical study [1.0152838128195467]
A distinguishing factor of natural language is that words are typically ordered according to the rules of the grammar of a given language.
We investigate how the word ordering of and lexical similarity between the source and target language affect translation performance.
arXiv Detail & Related papers (2021-12-31T23:28:28Z) - Inducing Language-Agnostic Multilingual Representations [61.97381112847459]
Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world.
We examine three approaches for this: (i) re-aligning the vector spaces of target languages to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering.
arXiv Detail & Related papers (2020-08-20T17:58:56Z) - Morphological Word Segmentation on Agglutinative Languages for Neural
Machine Translation [8.87546236839959]
We propose a morphological word segmentation method on the source-side for Neural machine translation (NMT)
It incorporates morphology knowledge to preserve the linguistic and semantic information in the word structure while reducing the vocabulary size at training time.
It can be utilized as a preprocessing tool to segment the words in agglutinative languages for other natural language processing (NLP) tasks.
arXiv Detail & Related papers (2020-01-02T10:05:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.