Leveraging Closed-Access Multilingual Embedding for Automatic Sentence
Alignment in Low Resource Languages
- URL: http://arxiv.org/abs/2311.12179v1
- Date: Mon, 20 Nov 2023 20:48:25 GMT
- Title: Leveraging Closed-Access Multilingual Embedding for Automatic Sentence
Alignment in Low Resource Languages
- Authors: Idris Abdulmumin and Auwal Abubakar Khalid and Shamsuddeen Hassan
Muhammad and Ibrahim Said Ahmad and Lukman Jibril Aliyu and Babangida Sani
and Bala Mairiga Abduljalil and Sani Ahmad Hassan
- Abstract summary: We present a simple but qualitative parallel sentence aligner that carefully leveraged the closed-access Cohere multilingual embedding.
The proposed approach achieved $94.96$ and $54.83$ f1 scores on FLORES and MAFAND-MT, compared to $3.64$ and $0.64$ of LASER respectively.
Our method also achieved an improvement of more than 5 BLEU scores over LASER, when the resulting datasets were used with MAFAND-MT to train translation models.
- Score: 2.4023321876826462
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The importance of qualitative parallel data in machine translation has long
been determined but it has always been very difficult to obtain such in
sufficient quantity for the majority of world languages, mainly because of the
associated cost and also the lack of accessibility to these languages. Despite
the potential for obtaining parallel datasets from online articles using
automatic approaches, forensic investigations have found a lot of
quality-related issues such as misalignment, and wrong language codes. In this
work, we present a simple but qualitative parallel sentence aligner that
carefully leveraged the closed-access Cohere multilingual embedding, a solution
that ranked second in the just concluded #CoHereAIHack 2023 Challenge (see
https://ai6lagos.devpost.com). The proposed approach achieved $94.96$ and
$54.83$ f1 scores on FLORES and MAFAND-MT, compared to $3.64$ and $0.64$ of
LASER respectively. Our method also achieved an improvement of more than 5 BLEU
scores over LASER, when the resulting datasets were used with MAFAND-MT dataset
to train translation models. Our code and data are available for research
purposes here (https://github.com/abumafrim/Cohere-Align).
Related papers
- Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.
But can these models relate corresponding concepts across languages, effectively being crosslingual?
This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages [44.85501254683431]
Question Answering datasets are scarce for languages other than English due to the cost and difficulties of collection and manual annotation.
We propose $textbfS$yn$textbfDAR$in, a method for generating and validating QA datasets for low-resource languages.
arXiv Detail & Related papers (2024-06-20T15:49:28Z) - How do Large Language Models Handle Multilingualism? [81.15060972112563]
This study explores how large language models (LLMs) handle multilingualism.
LLMs initially understand the query, converting multilingual inputs into English for task-solving.
In the intermediate layers, they employ English for thinking and incorporate multilingual knowledge with self-attention and feed-forward structures.
arXiv Detail & Related papers (2024-02-29T02:55:26Z) - Question Translation Training for Better Multilingual Reasoning [108.10066378240879]
Large language models show compelling performance on reasoning tasks but they tend to perform much worse in languages other than English.
A typical solution is to translate instruction data into all languages of interest, and then train on the resulting multilingual data, which is called translate-training.
In this paper we explore the benefits of question alignment, where we train the model to translate reasoning questions into English by finetuning on X-English parallel question data.
arXiv Detail & Related papers (2024-01-15T16:39:10Z) - LAMPAT: Low-Rank Adaption for Multilingual Paraphrasing Using Adversarial Training [19.173992333194683]
Paraphrases are texts that convey the same meaning while using different words or sentence structures.
Previous studies have leveraged the knowledge from the machine translation field, forming a paraphrase through zero-shot machine translation in the same language.
We propose the first unsupervised multilingual paraphrasing model, LAMPAT, by which monolingual dataset is sufficient enough to generate a human-like and diverse sentence.
arXiv Detail & Related papers (2024-01-09T04:19:16Z) - GlotLID: Language Identification for Low-Resource Languages [51.38634652914054]
GlotLID-M is an LID model that satisfies the desiderata of wide coverage, reliability and efficiency.
It identifies 1665 languages, a large increase in coverage compared to prior work.
arXiv Detail & Related papers (2023-10-24T23:45:57Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Sinhala-English Parallel Word Dictionary Dataset [0.554780083433538]
We introduce three parallel English-Sinhala word dictionaries (En-Si-dict-large, En-Si-dict-filtered, En-Si-dict-FastText) which help in multilingual Natural Language Processing (NLP) tasks related to English and Sinhala languages.
arXiv Detail & Related papers (2023-08-04T10:21:35Z) - Tencent's Multilingual Machine Translation System for WMT22 Large-Scale
African Languages [47.06332023467713]
This paper describes Tencent's multilingual machine translation systems for the WMT22 shared task on Large-Scale Machine Translation Evaluation for African Languages.
We adopt data augmentation, distributionally robust optimization, and language family grouping, respectively, to develop our multilingual neural machine translation (MNMT) models.
arXiv Detail & Related papers (2022-10-18T07:22:29Z) - Majority Voting with Bidirectional Pre-translation For Bitext Retrieval [2.580271290008534]
A popular approach has been to mine so-called "pseudo-parallel" sentences from paired documents in two languages.
In this paper, we outline some problems with current methods, propose computationally economical solutions to those problems, and demonstrate success with novel methods.
We make the code and data used for our experiments publicly available.
arXiv Detail & Related papers (2021-03-10T22:24:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.