Machine Translation by Projecting Text into the Same
Phonetic-Orthographic Space Using a Common Encoding
- URL: http://arxiv.org/abs/2305.12371v1
- Date: Sun, 21 May 2023 06:46:33 GMT
- Title: Machine Translation by Projecting Text into the Same
Phonetic-Orthographic Space Using a Common Encoding
- Authors: Amit Kumar, Shantipriya Parida, Ajay Pratap and Anil Kumar Singh
- Abstract summary: We propose an approach based on common multilingual Latin-based encodings (WX notation) that take advantage of language similarity.
We verify the proposed approach by demonstrating experiments on similar language pairs.
We also get up to 1 BLEU points improvement on distant and zero-shot language pairs.
- Score: 3.0422770070015295
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The use of subword embedding has proved to be a major innovation in Neural
Machine Translation (NMT). It helps NMT to learn better context vectors for Low
Resource Languages (LRLs) so as to predict the target words by better modelling
the morphologies of the two languages and also the morphosyntax transfer. Even
so, their performance for translation in Indian language to Indian language
scenario is still not as good as for resource-rich languages. One reason for
this is the relative morphological richness of Indian languages, while another
is that most of them fall into the extremely low resource or zero-shot
categories. Since most major Indian languages use Indic or Brahmi origin
scripts, the text written in them is highly phonetic in nature and phonetically
similar in terms of abstract letters and their arrangements. We use these
characteristics of Indian languages and their scripts to propose an approach
based on common multilingual Latin-based encodings (WX notation) that take
advantage of language similarity while addressing the morphological complexity
issue in NMT. These multilingual Latin-based encodings in NMT, together with
Byte Pair Embedding (BPE) allow us to better exploit their phonetic and
orthographic as well as lexical similarities to improve the translation quality
by projecting different but similar languages on the same orthographic-phonetic
character space. We verify the proposed approach by demonstrating experiments
on similar language pairs (Gujarati-Hindi, Marathi-Hindi, Nepali-Hindi,
Maithili-Hindi, Punjabi-Hindi, and Urdu-Hindi) under low resource conditions.
The proposed approach shows an improvement in a majority of cases, in one case
as much as ~10 BLEU points compared to baseline techniques for similar language
pairs. We also get up to ~1 BLEU points improvement on distant and zero-shot
language pairs.
Related papers
- Script-Agnostic Language Identification [21.19710835737713]
Many modern languages, such as Konkani, Kashmiri, Punjabi etc., are synchronically written in several scripts.
We propose learning script-agnostic representations using several different experimental strategies.
We find that word-level script randomization and exposure to a language written in multiple scripts is extremely valuable for downstream script-agnostic language identification.
arXiv Detail & Related papers (2024-06-25T19:23:42Z) - MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling [70.34758460372629]
We introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages.
MYTE produces shorter encodings for all 99 analyzed languages.
This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
arXiv Detail & Related papers (2024-03-15T21:21:11Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Investigating Lexical Sharing in Multilingual Machine Translation for
Indian Languages [8.858671209228536]
We investigate lexical sharing in multilingual machine translation from Hindi, Gujarati, Nepali into English.
We find that transliteration does not give pronounced improvements.
Our analysis suggests that our multilingual MT models trained on original scripts seem to already be robust to cross-script differences.
arXiv Detail & Related papers (2023-05-04T23:35:15Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Linking Emergent and Natural Languages via Corpus Transfer [98.98724497178247]
We propose a novel way to establish a link by corpus transfer between emergent languages and natural languages.
Our approach showcases non-trivial transfer benefits for two different tasks -- language modeling and image captioning.
We also introduce a novel metric to predict the transferability of an emergent language by translating emergent messages to natural language captions grounded on the same images.
arXiv Detail & Related papers (2022-03-24T21:24:54Z) - How do lexical semantics affect translation? An empirical study [1.0152838128195467]
A distinguishing factor of natural language is that words are typically ordered according to the rules of the grammar of a given language.
We investigate how the word ordering of and lexical similarity between the source and target language affect translation performance.
arXiv Detail & Related papers (2021-12-31T23:28:28Z) - Harnessing Cross-lingual Features to Improve Cognate Detection for
Low-resource Languages [50.82410844837726]
We demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian languages.
We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages.
We observe an improvement of up to 18% points, in terms of F-score, for cognate detection.
arXiv Detail & Related papers (2021-12-16T11:17:58Z) - Role of Language Relatedness in Multilingual Fine-tuning of Language
Models: A Case Study in Indo-Aryan Languages [34.79533646549939]
We explore the impact of leveraging the relatedness of languages that belong to the same family in NLP models using multilingual fine-tuning.
Low resource languages, such as Oriya and Punjabi, are found to be the largest beneficiaries of multilingual fine-tuning.
arXiv Detail & Related papers (2021-09-22T06:37:39Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Neural Machine Translation System of Indic Languages -- An Attention
based Approach [0.5139874302398955]
In India, almost all the languages are originated from their ancestral language - Sanskrit.
In this paper, we have presented the neural machine translation system (NMT) that can efficiently translate Indic languages like Hindi and Gujarati.
arXiv Detail & Related papers (2020-02-02T07:15:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.