A Corpus for English-Japanese Multimodal Neural Machine Translation with
Comparable Sentences
- URL: http://arxiv.org/abs/2010.08725v1
- Date: Sat, 17 Oct 2020 06:12:25 GMT
- Title: A Corpus for English-Japanese Multimodal Neural Machine Translation with
Comparable Sentences
- Authors: Andrew Merritt, Chenhui Chu, Yuki Arase
- Abstract summary: We propose a new multimodal English-Japanese corpus with comparable sentences that are compiled from existing image captioning datasets.
Due to low translation scores in our baseline experiments, we believe that current multimodal NMT models are not designed to effectively utilize comparable sentence data.
- Score: 21.43163704217968
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal neural machine translation (NMT) has become an increasingly
important area of research over the years because additional modalities, such
as image data, can provide more context to textual data. Furthermore, the
viability of training multimodal NMT models without a large parallel corpus
continues to be investigated due to low availability of parallel sentences with
images, particularly for English-Japanese data. However, this void can be
filled with comparable sentences that contain bilingual terms and parallel
phrases, which are naturally created through media such as social network posts
and e-commerce product descriptions. In this paper, we propose a new multimodal
English-Japanese corpus with comparable sentences that are compiled from
existing image captioning datasets. In addition, we supplement our comparable
sentences with a smaller parallel corpus for validation and test purposes. To
test the performance of this comparable sentence translation scenario, we train
several baseline NMT models with our comparable corpus and evaluate their
English-Japanese translation performance. Due to low translation scores in our
baseline experiments, we believe that current multimodal NMT models are not
designed to effectively utilize comparable sentence data. Despite this, we hope
for our corpus to be used to further research into multimodal NMT with
comparable sentences.
Related papers
- Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine
Translation of Lecture Transcripts [50.00305136008848]
We propose a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera.
For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets.
This study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits.
arXiv Detail & Related papers (2023-11-07T03:50:25Z) - Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models.
We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks.
OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z) - Beyond Triplet: Leveraging the Most Data for Multimodal Machine
Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision.
Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets.
This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z) - Neural Machine Translation with Contrastive Translation Memories [71.86990102704311]
Retrieval-augmented Neural Machine Translation models have been successful in many translation scenarios.
We propose a new retrieval-augmented NMT to model contrastively retrieved translation memories that are holistically similar to the source sentence.
In training phase, a Multi-TM contrastive learning objective is introduced to learn salient feature of each TM with respect to target sentence.
arXiv Detail & Related papers (2022-12-06T17:10:17Z) - Language Agnostic Multilingual Information Retrieval with Contrastive
Learning [59.26316111760971]
We present an effective method to train multilingual information retrieval systems.
We leverage parallel and non-parallel corpora to improve the pretrained multilingual language models.
Our model can work well even with a small number of parallel sentences.
arXiv Detail & Related papers (2022-10-12T23:53:50Z) - Unsupervised Parallel Corpus Mining on Web Data [53.74427402568838]
We present a pipeline to mine the parallel corpus from the Internet in an unsupervised manner.
Our system produces new state-of-the-art results, 39.81 and 38.95 BLEU scores, even compared with supervised approaches.
arXiv Detail & Related papers (2020-09-18T02:38:01Z) - Multiple Segmentations of Thai Sentences for Neural Machine Translation [6.1335228645093265]
We show how to augment a set of English--Thai parallel data by replicating sentence-pairs with different word segmentation methods on Thai.
Experiments show that combining these datasets, performance is improved for NMT models trained with a dataset that has been split using a supervised splitting tool.
arXiv Detail & Related papers (2020-04-23T21:48:58Z) - Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures
Translation [37.04364877980479]
We show how to mine a parallel corpus from publicly available lectures at Coursera.
Our approach determines sentence alignments, relying on machine translation and cosine similarity over continuous-space sentence representations.
For Japanese--English lectures translation, we extracted parallel data of approximately 40,000 lines and created development and test sets.
arXiv Detail & Related papers (2019-12-26T01:12:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.