A Bilingual Parallel Corpus with Discourse Annotations
- URL: http://arxiv.org/abs/2210.14667v1
- Date: Wed, 26 Oct 2022 12:33:53 GMT
- Title: A Bilingual Parallel Corpus with Discourse Annotations
- Authors: Yuchen Eleanor Jiang, Tianyu Liu, Shuming Ma, Dongdong Zhang, Mrinmaya
Sachan, Ryan Cotterell
- Abstract summary: This paper describes BWB, a large parallel corpus first introduced in Jiang et al. (2022), along with an annotated test set.
The BWB corpus consists of Chinese novels translated by experts into English, and the annotated test set is designed to probe the ability of machine translation systems to model various discourse phenomena.
- Score: 82.07304301996562
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Machine translation (MT) has almost achieved human parity at sentence-level
translation. In response, the MT community has, in part, shifted its focus to
document-level translation. However, the development of document-level MT
systems is hampered by the lack of parallel document corpora. This paper
describes BWB, a large parallel corpus first introduced in Jiang et al. (2022),
along with an annotated test set. The BWB corpus consists of Chinese novels
translated by experts into English, and the annotated test set is designed to
probe the ability of machine translation systems to model various discourse
phenomena. Our resource is freely available, and we hope it will serve as a
guide and inspiration for more work in document-level machine translation.
Related papers
- Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine
Translation of Lecture Transcripts [50.00305136008848]
We propose a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera.
For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets.
This study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits.
arXiv Detail & Related papers (2023-11-07T03:50:25Z) - Discourse Centric Evaluation of Machine Translation with a Densely
Annotated Parallel Corpus [82.07304301996562]
This paper presents a new dataset with rich discourse annotations, built upon the large-scale parallel corpus BWB introduced in Jiang et al.
We investigate the similarities and differences between the discourse structures of source and target languages.
We discover that MT outputs differ fundamentally from human translations in terms of their latent discourse structures.
arXiv Detail & Related papers (2023-05-18T17:36:41Z) - Document-aligned Japanese-English Conversation Parallel Corpus [4.793904440030568]
Sentence-level (SL) machine translation (MT) has reached acceptable quality for many high-resourced languages, but not document-level (DL) MT.
We present a document-aligned Japanese-English conversation corpus, including balanced, high-quality business conversation data for tuning and testing.
We train MT models using our corpus to demonstrate how using context leads to improvements.
arXiv Detail & Related papers (2020-12-11T06:03:33Z) - Diving Deep into Context-Aware Neural Machine Translation [36.17847243492193]
This paper analyzes the performance of document-level NMT models on four diverse domains.
We find that there is no single best approach to document-level NMT, but rather that different architectures come out on top on different tasks.
arXiv Detail & Related papers (2020-10-19T13:23:12Z) - Unsupervised Bitext Mining and Translation via Self-trained Contextual
Embeddings [51.47607125262885]
We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text.
We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model via self-training.
We validate our technique by extracting parallel sentence pairs on the BUCC 2017 bitext mining task and observe up to a 24.5 point increase (absolute) in F1 scores over previous unsupervised methods.
arXiv Detail & Related papers (2020-10-15T14:04:03Z) - SJTU-NICT's Supervised and Unsupervised Neural Machine Translation
Systems for the WMT20 News Translation Task [111.91077204077817]
We participated in four translation directions of three language pairs: English-Chinese, English-Polish, and German-Upper Sorbian.
Based on different conditions of language pairs, we have experimented with diverse neural machine translation (NMT) techniques.
In our submissions, the primary systems won the first place on English to Chinese, Polish to English, and German to Upper Sorbian translation directions.
arXiv Detail & Related papers (2020-10-11T00:40:05Z) - Designing the Business Conversation Corpus [20.491255702901288]
We aim to boost the machine translation quality of conversational texts by introducing a newly constructed Japanese-English business conversation parallel corpus.
A detailed analysis of the corpus is provided along with challenging examples for automatic translation.
We also experiment with adding the corpus in a machine translation training scenario and show how the resulting system benefits from its use.
arXiv Detail & Related papers (2020-08-05T05:19:44Z) - Learning Contextualized Sentence Representations for Document-Level
Neural Machine Translation [59.191079800436114]
Document-level machine translation incorporates inter-sentential dependencies into the translation of a source sentence.
We propose a new framework to model cross-sentence dependencies by training neural machine translation (NMT) to predict both the target translation and surrounding sentences of a source sentence.
arXiv Detail & Related papers (2020-03-30T03:38:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.