Document-aligned Japanese-English Conversation Parallel Corpus
- URL: http://arxiv.org/abs/2012.06143v1
- Date: Fri, 11 Dec 2020 06:03:33 GMT
- Title: Document-aligned Japanese-English Conversation Parallel Corpus
- Authors: Mat\=iss Rikters, Ryokan Ri, Tong Li, Toshiaki Nakazawa
- Abstract summary: Sentence-level (SL) machine translation (MT) has reached acceptable quality for many high-resourced languages, but not document-level (DL) MT.
We present a document-aligned Japanese-English conversation corpus, including balanced, high-quality business conversation data for tuning and testing.
We train MT models using our corpus to demonstrate how using context leads to improvements.
- Score: 4.793904440030568
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sentence-level (SL) machine translation (MT) has reached acceptable quality
for many high-resourced languages, but not document-level (DL) MT, which is
difficult to 1) train with little amount of DL data; and 2) evaluate, as the
main methods and data sets focus on SL evaluation. To address the first issue,
we present a document-aligned Japanese-English conversation corpus, including
balanced, high-quality business conversation data for tuning and testing. As
for the second issue, we manually identify the main areas where SL MT fails to
produce adequate translations in lack of context. We then create an evaluation
set where these phenomena are annotated to alleviate automatic evaluation of DL
systems. We train MT models using our corpus to demonstrate how using context
leads to improvements.
Related papers
- Improving Long Context Document-Level Machine Translation [51.359400776242786]
Document-level context for neural machine translation (NMT) is crucial to improve translation consistency and cohesion.
Many works have been published on the topic of document-level NMT, but most restrict the system to just local context.
We propose a constrained attention variant that focuses the attention on the most relevant parts of the sequence, while simultaneously reducing the memory consumption.
arXiv Detail & Related papers (2023-06-08T13:28:48Z) - On Search Strategies for Document-Level Neural Machine Translation [51.359400776242786]
Document-level neural machine translation (NMT) models produce a more consistent output across a document.
In this work, we aim to answer the question how to best utilize a context-aware translation model in decoding.
arXiv Detail & Related papers (2023-06-08T11:30:43Z) - Discourse Centric Evaluation of Machine Translation with a Densely
Annotated Parallel Corpus [82.07304301996562]
This paper presents a new dataset with rich discourse annotations, built upon the large-scale parallel corpus BWB introduced in Jiang et al.
We investigate the similarities and differences between the discourse structures of source and target languages.
We discover that MT outputs differ fundamentally from human translations in terms of their latent discourse structures.
arXiv Detail & Related papers (2023-05-18T17:36:41Z) - A Bilingual Parallel Corpus with Discourse Annotations [82.07304301996562]
This paper describes BWB, a large parallel corpus first introduced in Jiang et al. (2022), along with an annotated test set.
The BWB corpus consists of Chinese novels translated by experts into English, and the annotated test set is designed to probe the ability of machine translation systems to model various discourse phenomena.
arXiv Detail & Related papers (2022-10-26T12:33:53Z) - When Does Translation Require Context? A Data-driven, Multilingual
Exploration [71.43817945875433]
proper handling of discourse significantly contributes to the quality of machine translation (MT)
Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation.
We develop the Multilingual Discourse-Aware benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena.
arXiv Detail & Related papers (2021-09-15T17:29:30Z) - ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality
Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee.
It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z) - Auto Correcting in the Process of Translation -- Multi-task Learning
Improves Dialogue Machine Translation [31.247920419523066]
We conduct a deep analysis of a dialogue corpus and summarize three major issues on dialogue translation.
We propose a joint learning method to identify omission and typo, and utilize context to translate dialogue utterances.
Our experiments show that the proposed method improves translation quality by 3.2 BLEU over the baselines.
arXiv Detail & Related papers (2021-03-30T09:12:47Z) - Majority Voting with Bidirectional Pre-translation For Bitext Retrieval [2.580271290008534]
A popular approach has been to mine so-called "pseudo-parallel" sentences from paired documents in two languages.
In this paper, we outline some problems with current methods, propose computationally economical solutions to those problems, and demonstrate success with novel methods.
We make the code and data used for our experiments publicly available.
arXiv Detail & Related papers (2021-03-10T22:24:01Z) - Towards Personalised and Document-level Machine Translation of Dialogue [0.0]
This thesis proposal focuses on PersNMT and DocNMT for the domain of dialogue extracted from TV subtitles in five languages.
Three main challenges are addressed: (1) incorporating extra-textual information directly into NMT systems; (2) improving the machine translation of cohesion devices; and (3) reliable evaluation for PersNMT and DocNMT.
arXiv Detail & Related papers (2021-02-11T09:18:20Z) - Can Your Context-Aware MT System Pass the DiP Benchmark Tests? :
Evaluation Benchmarks for Discourse Phenomena in Machine Translation [7.993547048820065]
We introduce the first of their kind MT benchmark datasets that aim to track and hail improvements across four main discourse phenomena.
Surprisingly, we find that existing context-aware models do not improve discourse-related translations consistently across languages and phenomena.
arXiv Detail & Related papers (2020-04-30T07:15:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.