Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine
Translation of Lecture Transcripts
- URL: http://arxiv.org/abs/2311.03696v1
- Date: Tue, 7 Nov 2023 03:50:25 GMT
- Title: Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine
Translation of Lecture Transcripts
- Authors: Haiyue Song, Raj Dabre, Chenhui Chu, Atsushi Fujita and Sadao
Kurohashi
- Abstract summary: We propose a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera.
For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets.
This study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits.
- Score: 50.00305136008848
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Lecture transcript translation helps learners understand online courses,
however, building a high-quality lecture machine translation system lacks
publicly available parallel corpora. To address this, we examine a framework
for parallel corpus mining, which provides a quick and effective way to mine a
parallel corpus from publicly available lectures on Coursera. To create the
parallel corpora, we propose a dynamic programming based sentence alignment
algorithm which leverages the cosine similarity of machine-translated
sentences. The sentence alignment F1 score reaches 96%, which is higher than
using the BERTScore, LASER, or sentBERT methods. For both English--Japanese and
English--Chinese lecture translations, we extracted parallel corpora of
approximately 50,000 lines and created development and test sets through manual
filtering for benchmarking translation performance. Through machine translation
experiments, we show that the mined corpora enhance the quality of lecture
transcript translation when used in conjunction with out-of-domain parallel
corpora via multistage fine-tuning. Furthermore, this study also suggests
guidelines for gathering and cleaning corpora, mining parallel sentences,
cleaning noise in the mined data, and creating high-quality evaluation splits.
For the sake of reproducibility, we have released the corpora as well as the
code to create them. The dataset is available at
https://github.com/shyyhs/CourseraParallelCorpusMining.
Related papers
- Dual-Alignment Pre-training for Cross-lingual Sentence Embedding [79.98111074307657]
We propose a dual-alignment pre-training (DAP) framework for cross-lingual sentence embedding.
We introduce a novel representation translation learning (RTL) task, where the model learns to use one-side contextualized token representation to reconstruct its translation counterpart.
Our approach can significantly improve sentence embedding.
arXiv Detail & Related papers (2023-05-16T03:53:30Z) - A Bilingual Parallel Corpus with Discourse Annotations [82.07304301996562]
This paper describes BWB, a large parallel corpus first introduced in Jiang et al. (2022), along with an annotated test set.
The BWB corpus consists of Chinese novels translated by experts into English, and the annotated test set is designed to probe the ability of machine translation systems to model various discourse phenomena.
arXiv Detail & Related papers (2022-10-26T12:33:53Z) - Word Alignment by Fine-tuning Embeddings on Parallel Corpora [96.28608163701055]
Word alignment over parallel corpora has a wide variety of applications, including learning translation lexicons, cross-lingual transfer of language processing tools, and automatic evaluation or analysis of translation outputs.
Recently, other work has demonstrated that pre-trained contextualized word embeddings derived from multilingually trained language models (LMs) prove an attractive alternative, achieving competitive results on the word alignment task even in the absence of explicit training on parallel data.
In this paper, we examine methods to marry the two approaches: leveraging pre-trained LMs but fine-tuning them on parallel text with objectives designed to improve alignment quality, and proposing
arXiv Detail & Related papers (2021-01-20T17:54:47Z) - A Corpus for English-Japanese Multimodal Neural Machine Translation with
Comparable Sentences [21.43163704217968]
We propose a new multimodal English-Japanese corpus with comparable sentences that are compiled from existing image captioning datasets.
Due to low translation scores in our baseline experiments, we believe that current multimodal NMT models are not designed to effectively utilize comparable sentence data.
arXiv Detail & Related papers (2020-10-17T06:12:25Z) - Unsupervised Parallel Corpus Mining on Web Data [53.74427402568838]
We present a pipeline to mine the parallel corpus from the Internet in an unsupervised manner.
Our system produces new state-of-the-art results, 39.81 and 38.95 BLEU scores, even compared with supervised approaches.
arXiv Detail & Related papers (2020-09-18T02:38:01Z) - Preparation of Sentiment tagged Parallel Corpus and Testing its effect
on Machine Translation [12.447116722795899]
The paper discusses the preparation of the same sentiment tagged English-Bengali parallel corpus.
The output of the translation model has been compared with a base-line translation model using automated metrics such as BLEU and TER.
arXiv Detail & Related papers (2020-07-28T09:04:47Z) - Parallel Corpus Filtering via Pre-trained Language Models [14.689457985200141]
Web-crawled data provides a good source of parallel corpora for training machine translation models.
Recent work shows that neural machine translation systems are more sensitive to noise than traditional statistical machine translation methods.
We propose a novel approach to filter out noisy sentence pairs from web-crawled corpora via pre-trained language models.
arXiv Detail & Related papers (2020-05-13T06:06:23Z) - Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures
Translation [37.04364877980479]
We show how to mine a parallel corpus from publicly available lectures at Coursera.
Our approach determines sentence alignments, relying on machine translation and cosine similarity over continuous-space sentence representations.
For Japanese--English lectures translation, we extracted parallel data of approximately 40,000 lines and created development and test sets.
arXiv Detail & Related papers (2019-12-26T01:12:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.