Related papers: Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation

Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation

URL: http://arxiv.org/abs/1912.11739v2
Date: Tue, 14 Jan 2020 03:16:24 GMT
Title: Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation
Authors: Haiyue Song, Raj Dabre, Atsushi Fujita, Sadao Kurohashi
Abstract summary: We show how to mine a parallel corpus from publicly available lectures at Coursera. Our approach determines sentence alignments, relying on machine translation and cosine similarity over continuous-space sentence representations. For Japanese--English lectures translation, we extracted parallel data of approximately 40,000 lines and created development and test sets.
Score: 37.04364877980479
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Lectures translation is a case of spoken language translation and there is a lack of publicly available parallel corpora for this purpose. To address this, we examine a language independent framework for parallel corpus mining which is a quick and effective way to mine a parallel corpus from publicly available lectures at Coursera. Our approach determines sentence alignments, relying on machine translation and cosine similarity over continuous-space sentence representations. We also show how to use the resulting corpora in a multistage fine-tuning based domain adaptation for high-quality lectures translation. For Japanese--English lectures translation, we extracted parallel data of approximately 40,000 lines and created development and test sets through manual filtering for benchmarking translation performance. We demonstrate that the mined corpus greatly enhances the quality of translation when used in conjunction with out-of-domain parallel corpora via multistage training. This paper also suggests some guidelines to gather and clean corpora, mine parallel sentences, address noise in the mined data, and create high-quality evaluation splits. For the sake of reproducibility, we will release our code for parallel data creation.

Related papers

Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine Translation of Lecture Transcripts [50.00305136008848]
We propose a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera. For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets. This study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits.
arXiv Detail & Related papers (2023-11-07T03:50:25Z)
Dual-Alignment Pre-training for Cross-lingual Sentence Embedding [79.98111074307657]
We propose a dual-alignment pre-training (DAP) framework for cross-lingual sentence embedding. We introduce a novel representation translation learning (RTL) task, where the model learns to use one-side contextualized token representation to reconstruct its translation counterpart. Our approach can significantly improve sentence embedding.
arXiv Detail & Related papers (2023-05-16T03:53:30Z)
Language Agnostic Multilingual Information Retrieval with Contrastive Learning [59.26316111760971]
We present an effective method to train multilingual information retrieval systems. We leverage parallel and non-parallel corpora to improve the pretrained multilingual language models. Our model can work well even with a small number of parallel sentences.
arXiv Detail & Related papers (2022-10-12T23:53:50Z)
Word Alignment by Fine-tuning Embeddings on Parallel Corpora [96.28608163701055]
Word alignment over parallel corpora has a wide variety of applications, including learning translation lexicons, cross-lingual transfer of language processing tools, and automatic evaluation or analysis of translation outputs. Recently, other work has demonstrated that pre-trained contextualized word embeddings derived from multilingually trained language models (LMs) prove an attractive alternative, achieving competitive results on the word alignment task even in the absence of explicit training on parallel data. In this paper, we examine methods to marry the two approaches: leveraging pre-trained LMs but fine-tuning them on parallel text with objectives designed to improve alignment quality, and proposing
arXiv Detail & Related papers (2021-01-20T17:54:47Z)
Context-aware Decoder for Neural Machine Translation using a Target-side Document-Level Language Model [12.543106304662059]
We present a method to turn a sentence-level translation model into a context-aware model by incorporating a document-level language model into the decoder. Our decoder is built upon only a sentence-level parallel corpora and monolingual corpora. In a theoretical viewpoint, the core part of this work is the novel representation of contextual information using point-wise mutual information between context and the current sentence.
arXiv Detail & Related papers (2020-10-24T08:06:18Z)
A Corpus for English-Japanese Multimodal Neural Machine Translation with Comparable Sentences [21.43163704217968]
We propose a new multimodal English-Japanese corpus with comparable sentences that are compiled from existing image captioning datasets. Due to low translation scores in our baseline experiments, we believe that current multimodal NMT models are not designed to effectively utilize comparable sentence data.
arXiv Detail & Related papers (2020-10-17T06:12:25Z)
Unsupervised Parallel Corpus Mining on Web Data [53.74427402568838]
We present a pipeline to mine the parallel corpus from the Internet in an unsupervised manner. Our system produces new state-of-the-art results, 39.81 and 38.95 BLEU scores, even compared with supervised approaches.
arXiv Detail & Related papers (2020-09-18T02:38:01Z)
Parallel Corpus Filtering via Pre-trained Language Models [14.689457985200141]
Web-crawled data provides a good source of parallel corpora for training machine translation models. Recent work shows that neural machine translation systems are more sensitive to noise than traditional statistical machine translation methods. We propose a novel approach to filter out noisy sentence pairs from web-crawled corpora via pre-trained language models.
arXiv Detail & Related papers (2020-05-13T06:06:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.