Building the Language Resource for a Cebuano-Filipino Neural Machine
Translation System
- URL: http://arxiv.org/abs/2110.15716v1
- Date: Tue, 5 Oct 2021 23:03:09 GMT
- Title: Building the Language Resource for a Cebuano-Filipino Neural Machine
Translation System
- Authors: Kristine Mae Adlaon and Nelson Marcos
- Abstract summary: We present the efforts made to build a parallel corpus for Cebuano and Filipino from two different domains: biblical texts and the web.
For the biblical resource, subword unit translation for verbs and copy-able approach for nouns were applied to correct inconsistencies in the translation.
For Wikipedia, commonly occurring topic segments were extracted from both the source and the target languages.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Parallel corpus is a critical resource in machine learning-based translation.
The task of collecting, extracting, and aligning texts in order to build an
acceptable corpus for doing the translation is very tedious most especially for
low-resource languages. In this paper, we present the efforts made to build a
parallel corpus for Cebuano and Filipino from two different domains: biblical
texts and the web. For the biblical resource, subword unit translation for
verbs and copy-able approach for nouns were applied to correct inconsistencies
in the translation. This correction mechanism was applied as a preprocessing
technique. On the other hand, for Wikipedia being the main web resource,
commonly occurring topic segments were extracted from both the source and the
target languages. These observed topic segments are unique in 4 different
categories. The identification of these topic segments may be used for the
automatic extraction of sentences. A Recurrent Neural Network was used to
implement the translation using OpenNMT sequence modeling tool in TensorFlow.
The two different corpora were then evaluated by using them as two separate
inputs in the neural network. Results have shown a difference in BLEU scores in
both corpora.
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine
Translation of Lecture Transcripts [50.00305136008848]
We propose a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera.
For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets.
This study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits.
arXiv Detail & Related papers (2023-11-07T03:50:25Z) - Sinhala-English Parallel Word Dictionary Dataset [0.554780083433538]
We introduce three parallel English-Sinhala word dictionaries (En-Si-dict-large, En-Si-dict-filtered, En-Si-dict-FastText) which help in multilingual Natural Language Processing (NLP) tasks related to English and Sinhala languages.
arXiv Detail & Related papers (2023-08-04T10:21:35Z) - T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - Bridging the Data Gap between Training and Inference for Unsupervised
Neural Machine Translation [49.916963624249355]
A UNMT model is trained on the pseudo parallel data with translated source, and natural source sentences in inference.
The source discrepancy between training and inference hinders the translation performance of UNMT models.
We propose an online self-training approach, which simultaneously uses the pseudo parallel data natural source, translated target to mimic the inference scenario.
arXiv Detail & Related papers (2022-03-16T04:50:27Z) - Sentence Alignment with Parallel Documents Helps Biomedical Machine
Translation [0.5430741734728369]
This work presents a new unsupervised sentence alignment method and explores features in training biomedical neural machine translation (NMT) systems.
We use a simple but effective way to build bilingual word embeddings to evaluate bilingual word similarity.
The proposed method achieved high accuracy in both 1-to-1 and many-to-many cases.
arXiv Detail & Related papers (2021-04-17T16:09:30Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Neural Simultaneous Speech Translation Using Alignment-Based Chunking [4.224809458327515]
In simultaneous machine translation, the objective is to determine when to produce a partial translation given a continuous stream of source words.
We propose a neural machine translation (NMT) model that makes dynamic decisions when to continue feeding on input or generate output words.
Our results on the IWSLT 2020 English-to-German task outperform a wait-k baseline by 2.6 to 3.7% BLEU absolute.
arXiv Detail & Related papers (2020-05-29T10:20:48Z) - Investigating Language Impact in Bilingual Approaches for Computational
Language Documentation [28.838960956506018]
This paper investigates how the choice of translation language affects the posterior documentation work.
We create 56 bilingual pairs that we apply to the task of low-resource unsupervised word segmentation and alignment.
Our results suggest that incorporating clues into the neural models' input representation increases their translation and alignment quality.
arXiv Detail & Related papers (2020-03-30T10:30:34Z) - Learning Contextualized Sentence Representations for Document-Level
Neural Machine Translation [59.191079800436114]
Document-level machine translation incorporates inter-sentential dependencies into the translation of a source sentence.
We propose a new framework to model cross-sentence dependencies by training neural machine translation (NMT) to predict both the target translation and surrounding sentences of a source sentence.
arXiv Detail & Related papers (2020-03-30T03:38:01Z) - Urdu-English Machine Transliteration using Neural Networks [0.0]
We present transliteration technique based on Expectation Maximization (EM) which is un-supervised and language independent.
System learns the pattern and out-of-vocabulary words from parallel corpus and there is no need to train it on transliteration corpus explicitly.
arXiv Detail & Related papers (2020-01-12T17:30:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.