JParaCrawl v3.0: A Large-scale English-Japanese Parallel Corpus
- URL: http://arxiv.org/abs/2202.12607v2
- Date: Mon, 28 Feb 2022 06:21:03 GMT
- Title: JParaCrawl v3.0: A Large-scale English-Japanese Parallel Corpus
- Authors: Makoto Morishita, Katsuki Chousa, Jun Suzuki, Masaaki Nagata
- Abstract summary: This paper creates a large parallel corpus for English-Japanese, a language pair for which only limited resources are available.
It introduces a new web-based English-Japanese parallel corpus named JParaCrawl v3.0.
Our new corpus contains more than 21 million unique parallel sentence pairs, which is more than twice as many as the previous JParaCrawl v2.0 corpus.
- Score: 31.203776611871863
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most current machine translation models are mainly trained with parallel
corpora, and their translation accuracy largely depends on the quality and
quantity of the corpora. Although there are billions of parallel sentences for
a few language pairs, effectively dealing with most language pairs is difficult
due to a lack of publicly available parallel corpora. This paper creates a
large parallel corpus for English-Japanese, a language pair for which only
limited resources are available, compared to such resource-rich languages as
English-German. It introduces a new web-based English-Japanese parallel corpus
named JParaCrawl v3.0. Our new corpus contains more than 21 million unique
parallel sentence pairs, which is more than twice as many as the previous
JParaCrawl v2.0 corpus. Through experiments, we empirically show how our new
corpus boosts the accuracy of machine translation models on various domains.
The JParaCrawl v3.0 corpus will eventually be publicly available online for
research purposes.
Related papers
- A Japanese-Chinese Parallel Corpus Using Crowdsourcing for Web Mining [20.18032411452028]
We created a Japanese-Chinese parallel corpus of 4.6M sentence pairs from bilingual websites.
We used a Japanese-Chinese bilingual dictionary of 160K word pairs for document and sentence alignment.
We compared the translation accuracy of the model trained on these 4.6M sentence pairs with that of the model trained on Japanese-Chinese sentence pairs from CCMatrix (12.4M), a parallel corpus from global web mining.
arXiv Detail & Related papers (2024-05-15T00:54:40Z) - Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine
Translation of Lecture Transcripts [50.00305136008848]
We propose a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera.
For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets.
This study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits.
arXiv Detail & Related papers (2023-11-07T03:50:25Z) - Textless Speech-to-Speech Translation With Limited Parallel Data [51.3588490789084]
PFB is a framework for training textless S2ST models that require just dozens of hours of parallel speech data.
We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains.
arXiv Detail & Related papers (2023-05-24T17:59:05Z) - A Bilingual Parallel Corpus with Discourse Annotations [82.07304301996562]
This paper describes BWB, a large parallel corpus first introduced in Jiang et al. (2022), along with an annotated test set.
The BWB corpus consists of Chinese novels translated by experts into English, and the annotated test set is designed to probe the ability of machine translation systems to model various discourse phenomena.
arXiv Detail & Related papers (2022-10-26T12:33:53Z) - Language Agnostic Multilingual Information Retrieval with Contrastive
Learning [59.26316111760971]
We present an effective method to train multilingual information retrieval systems.
We leverage parallel and non-parallel corpora to improve the pretrained multilingual language models.
Our model can work well even with a small number of parallel sentences.
arXiv Detail & Related papers (2022-10-12T23:53:50Z) - EAG: Extract and Generate Multi-way Aligned Corpus for Complete Multi-lingual Neural Machine Translation [63.88541605363555]
"Extract and Generate" (EAG) is a two-step approach to construct large-scale and high-quality multi-way aligned corpus from bilingual data.
We first extract candidate aligned examples by pairing the bilingual examples from different language pairs with highly similar source or target sentences.
We then generate the final aligned examples from the candidates with a well-trained generation model.
arXiv Detail & Related papers (2022-03-04T08:21:27Z) - Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining [38.10950540247151]
We propose a novel unsupervised method to derive multilingual sentence embeddings relying only on monolingual data.
We first produce a synthetic parallel corpus using unsupervised machine translation, and use it to fine-tune a pretrained cross-lingual masked language model (XLM)
The quality of the representations is evaluated on two parallel corpus mining tasks with improvements of up to 22 F1 points over vanilla XLM.
arXiv Detail & Related papers (2021-05-21T15:39:16Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - PMIndia -- A Collection of Parallel Corpora of Languages of India [10.434922903332415]
We describe a new publicly available corpus (PMIndia) consisting of parallel sentences which pair 13 major languages of India with English.
The corpus includes up to 56000 sentences for each language pair.
We explain how the corpus was constructed, including an assessment of two different automatic sentence alignment methods, and present some initial NMT results on the corpus.
arXiv Detail & Related papers (2020-01-27T16:51:39Z) - Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures
Translation [37.04364877980479]
We show how to mine a parallel corpus from publicly available lectures at Coursera.
Our approach determines sentence alignments, relying on machine translation and cosine similarity over continuous-space sentence representations.
For Japanese--English lectures translation, we extracted parallel data of approximately 40,000 lines and created development and test sets.
arXiv Detail & Related papers (2019-12-26T01:12:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.