Towards Computational Linguistics in Minangkabau Language: Studies on
Sentiment Analysis and Machine Translation
- URL: http://arxiv.org/abs/2009.09309v1
- Date: Sat, 19 Sep 2020 22:13:27 GMT
- Title: Towards Computational Linguistics in Minangkabau Language: Studies on
Sentiment Analysis and Machine Translation
- Authors: Fajri Koto, Ikhwan Koto
- Abstract summary: We release two Minangkabau corpora: sentiment analysis and machine translation that are harvested and constructed from Twitter and Wikipedia.
We conduct the first computational linguistics in Minangkabau language employing classic machine learning and sequence-to-sequence models such as LSTM and Transformer.
- Score: 5.381004207943597
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although some linguists (Rusmali et al., 1985; Crouch, 2009) have fairly
attempted to define the morphology and syntax of Minangkabau, information
processing in this language is still absent due to the scarcity of the
annotated resource. In this work, we release two Minangkabau corpora: sentiment
analysis and machine translation that are harvested and constructed from
Twitter and Wikipedia. We conduct the first computational linguistics in
Minangkabau language employing classic machine learning and
sequence-to-sequence models such as LSTM and Transformer. Our first experiments
show that the classification performance over Minangkabau text significantly
drops when tested with the model trained in Indonesian. Whereas, in the machine
translation experiment, a simple word-to-word translation using a bilingual
dictionary outperforms LSTM and Transformer model in terms of BLEU score.
Related papers
- LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation [67.24113079928668]
We present LexMatcher, a method for data curation driven by the coverage of senses found in bilingual dictionaries.
Our approach outperforms the established baselines on the WMT2022 test sets.
arXiv Detail & Related papers (2024-06-03T15:30:36Z) - Character-level NMT and language similarity [1.90365714903665]
We explore the effectiveness of character-level neural machine translation for various levels of language similarity and size of the training dataset on translation between Czech and Croatian, German, Hungarian, Slovak, and Spanish.
We evaluate the models using automatic MT metrics and show that translation between similar languages benefits from character-level input segmentation.
We confirm previous findings that it is possible to close the gap by finetuning the already trained subword-level models to character-level.
arXiv Detail & Related papers (2023-08-08T17:01:42Z) - Decomposed Prompting for Machine Translation Between Related Languages
using Large Language Models [55.35106713257871]
We introduce DecoMT, a novel approach of few-shot prompting that decomposes the translation process into a sequence of word chunk translations.
We show that DecoMT outperforms the strong few-shot prompting BLOOM model with an average improvement of 8 chrF++ scores across the examined languages.
arXiv Detail & Related papers (2023-05-22T14:52:47Z) - Domain-Specific Text Generation for Machine Translation [7.803471587734353]
We propose a novel approach to domain adaptation leveraging state-of-the-art pretrained language models (LMs) for domain-specific data augmentation.
We employ mixed fine-tuning to train models that significantly improve translation of in-domain texts.
arXiv Detail & Related papers (2022-08-11T16:22:16Z) - Modeling Target-Side Morphology in Neural Machine Translation: A
Comparison of Strategies [72.56158036639707]
Morphologically rich languages pose difficulties to machine translation.
A large amount of differently inflected word surface forms entails a larger vocabulary.
Some inflected forms of infrequent terms typically do not appear in the training corpus.
Linguistic agreement requires the system to correctly match the grammatical categories between inflected word forms in the output sentence.
arXiv Detail & Related papers (2022-03-25T10:13:20Z) - ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality
Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee.
It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z) - Domain Adaptation of NMT models for English-Hindi Machine Translation
Task at AdapMT ICON 2020 [2.572404739180802]
This paper describes the neural machine translation systems for the English-Hindi language presented in AdapMT Shared Task ICON 2020.
Our team was ranked first in the chemistry and general domain En-Hi translation task and second in the AI domain En-Hi translation task.
arXiv Detail & Related papers (2020-12-22T15:46:40Z) - Pre-training Multilingual Neural Machine Translation by Leveraging
Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model.
We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z) - HausaMT v1.0: Towards English-Hausa Neural Machine Translation [0.012691047660244334]
We build a baseline model for English-Hausa machine translation.
The Hausa language is the second largest Afro-Asiatic language in the world after Arabic.
arXiv Detail & Related papers (2020-06-09T02:08:03Z) - It's Easier to Translate out of English than into it: Measuring Neural
Translation Difficulty by Cross-Mutual Information [90.35685796083563]
Cross-mutual information (XMI) is an asymmetric information-theoretic metric of machine translation difficulty.
XMI exploits the probabilistic nature of most neural machine translation models.
We present the first systematic and controlled study of cross-lingual translation difficulties using modern neural translation systems.
arXiv Detail & Related papers (2020-05-05T17:38:48Z) - Urdu-English Machine Transliteration using Neural Networks [0.0]
We present transliteration technique based on Expectation Maximization (EM) which is un-supervised and language independent.
System learns the pattern and out-of-vocabulary words from parallel corpus and there is no need to train it on transliteration corpus explicitly.
arXiv Detail & Related papers (2020-01-12T17:30:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.