KazParC: Kazakh Parallel Corpus for Machine Translation
- URL: http://arxiv.org/abs/2403.19399v3
- Date: Tue, 9 Apr 2024 20:58:41 GMT
- Title: KazParC: Kazakh Parallel Corpus for Machine Translation
- Authors: Rustem Yeshpanov, Alina Polonskaya, Huseyin Atakan Varol,
- Abstract summary: We introduce KazParC, a parallel corpus designed for machine translation across Kazakh, English, Russian, and Turkish.
Our research efforts also extend to the development of a neural machine translation model nicknamed Tilmash.
- Score: 3.1119394814248253
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce KazParC, a parallel corpus designed for machine translation across Kazakh, English, Russian, and Turkish. The first and largest publicly available corpus of its kind, KazParC contains a collection of 371,902 parallel sentences covering different domains and developed with the assistance of human translators. Our research efforts also extend to the development of a neural machine translation model nicknamed Tilmash. Remarkably, the performance of Tilmash is on par with, and in certain instances, surpasses that of industry giants, such as Google Translate and Yandex Translate, as measured by standard evaluation metrics, such as BLEU and chrF. Both KazParC and Tilmash are openly available for download under the Creative Commons Attribution 4.0 International License (CC BY 4.0) through our GitHub repository.
Related papers
- DIETA: A Decoder-only transformer-based model for Italian-English machine TrAnslation [74.85762984118024]
DIETA is a small, decoder-only Transformer model with 0.5 billion parameters.<n>We collect and curate a large parallel corpus consisting of approximately 207 million Italian-English sentence pairs.<n>We release a new small-scale evaluation set, consisting of 450 sentences, based on 2025 WikiNews articles.
arXiv Detail & Related papers (2026-01-25T13:08:43Z) - Low-resource Machine Translation for Code-switched Kazakh-Russian Language Pair [4.445432761373431]
We propose a method to build a machine translation model for code-switched Kazakh-Russian language pair with no labeled data.
We present the first codeswitching Kazakh-Russian parallel corpus and the evaluation results.
arXiv Detail & Related papers (2025-03-25T18:46:30Z) - Creating Domain-Specific Translation Memories for Machine Translation Fine-tuning: The TRENCARD Bilingual Cardiology Corpus [0.0]
The article introduces a semi-automatic TM preparation methodology leveraging primarily translation tools used by translators.
The resulting corpus called TRENCARD Corpus has approximately 800,000 source words and 50,000 sentences.
arXiv Detail & Related papers (2024-09-04T12:48:30Z) - Decoupled Vocabulary Learning Enables Zero-Shot Translation from Unseen Languages [55.157295899188476]
neural machine translation systems learn to map sentences of different languages into a common representation space.
In this work, we test this hypothesis by zero-shot translating from unseen languages.
We demonstrate that this setup enables zero-shot translation from entirely unseen languages.
arXiv Detail & Related papers (2024-08-05T07:58:58Z) - A Japanese-Chinese Parallel Corpus Using Crowdsourcing for Web Mining [20.18032411452028]
We created a Japanese-Chinese parallel corpus of 4.6M sentence pairs from bilingual websites.
We used a Japanese-Chinese bilingual dictionary of 160K word pairs for document and sentence alignment.
We compared the translation accuracy of the model trained on these 4.6M sentence pairs with that of the model trained on Japanese-Chinese sentence pairs from CCMatrix (12.4M), a parallel corpus from global web mining.
arXiv Detail & Related papers (2024-05-15T00:54:40Z) - Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine
Translation of Lecture Transcripts [50.00305136008848]
We propose a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera.
For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets.
This study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits.
arXiv Detail & Related papers (2023-11-07T03:50:25Z) - Decomposed Prompting for Machine Translation Between Related Languages
using Large Language Models [55.35106713257871]
We introduce DecoMT, a novel approach of few-shot prompting that decomposes the translation process into a sequence of word chunk translations.
We show that DecoMT outperforms the strong few-shot prompting BLOOM model with an average improvement of 8 chrF++ scores across the examined languages.
arXiv Detail & Related papers (2023-05-22T14:52:47Z) - A Bilingual Parallel Corpus with Discourse Annotations [82.07304301996562]
This paper describes BWB, a large parallel corpus first introduced in Jiang et al. (2022), along with an annotated test set.
The BWB corpus consists of Chinese novels translated by experts into English, and the annotated test set is designed to probe the ability of machine translation systems to model various discourse phenomena.
arXiv Detail & Related papers (2022-10-26T12:33:53Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Leveraging Multilingual News Websites for Building a Kurdish Parallel
Corpus [0.6445605125467573]
We present a corpus containing 12,327 translation pairs in the two major dialects of Kurdish, Sorani and Kurmanji.
We also provide 1,797 and 650 translation pairs in English-Kurmanji and English-Sorani.
arXiv Detail & Related papers (2020-10-04T11:52:50Z) - Lite Training Strategies for Portuguese-English and English-Portuguese
Translation [67.4894325619275]
We investigate the use of pre-trained models, such as T5, for Portuguese-English and English-Portuguese translation tasks.
We propose an adaptation of the English tokenizer to represent Portuguese characters, such as diaeresis, acute and grave accents.
Our results show that our models have a competitive performance to state-of-the-art models while being trained on modest hardware.
arXiv Detail & Related papers (2020-08-20T04:31:03Z) - Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures
Translation [37.04364877980479]
We show how to mine a parallel corpus from publicly available lectures at Coursera.
Our approach determines sentence alignments, relying on machine translation and cosine similarity over continuous-space sentence representations.
For Japanese--English lectures translation, we extracted parallel data of approximately 40,000 lines and created development and test sets.
arXiv Detail & Related papers (2019-12-26T01:12:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.