Designing the Business Conversation Corpus
- URL: http://arxiv.org/abs/2008.01940v1
- Date: Wed, 5 Aug 2020 05:19:44 GMT
- Title: Designing the Business Conversation Corpus
- Authors: Mat\=iss Rikters, Ryokan Ri, Tong Li, Toshiaki Nakazawa
- Abstract summary: We aim to boost the machine translation quality of conversational texts by introducing a newly constructed Japanese-English business conversation parallel corpus.
A detailed analysis of the corpus is provided along with challenging examples for automatic translation.
We also experiment with adding the corpus in a machine translation training scenario and show how the resulting system benefits from its use.
- Score: 20.491255702901288
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While the progress of machine translation of written text has come far in the
past several years thanks to the increasing availability of parallel corpora
and corpora-based training technologies, automatic translation of spoken text
and dialogues remains challenging even for modern systems. In this paper, we
aim to boost the machine translation quality of conversational texts by
introducing a newly constructed Japanese-English business conversation parallel
corpus. A detailed analysis of the corpus is provided along with challenging
examples for automatic translation. We also experiment with adding the corpus
in a machine translation training scenario and show how the resulting system
benefits from its use.
Related papers
- Context-Aware LLM Translation System Using Conversation Summarization and Dialogue History [10.596661157821462]
We propose a context-aware LLM translation system for the English-Korean language pair.
Our approach incorporates the two most recent dialogues as raw data and a summary of earlier conversations to manage context length effectively.
arXiv Detail & Related papers (2024-10-22T07:45:18Z) - Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine
Translation of Lecture Transcripts [50.00305136008848]
We propose a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera.
For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets.
This study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits.
arXiv Detail & Related papers (2023-11-07T03:50:25Z) - Decomposed Prompting for Machine Translation Between Related Languages
using Large Language Models [55.35106713257871]
We introduce DecoMT, a novel approach of few-shot prompting that decomposes the translation process into a sequence of word chunk translations.
We show that DecoMT outperforms the strong few-shot prompting BLOOM model with an average improvement of 8 chrF++ scores across the examined languages.
arXiv Detail & Related papers (2023-05-22T14:52:47Z) - HanoiT: Enhancing Context-aware Translation via Selective Context [95.93730812799798]
Context-aware neural machine translation aims to use the document-level context to improve translation quality.
The irrelevant or trivial words may bring some noise and distract the model from learning the relationship between the current sentence and the auxiliary context.
We propose a novel end-to-end encoder-decoder model with a layer-wise selection mechanism to sift and refine the long document context.
arXiv Detail & Related papers (2023-01-17T12:07:13Z) - A Bilingual Parallel Corpus with Discourse Annotations [82.07304301996562]
This paper describes BWB, a large parallel corpus first introduced in Jiang et al. (2022), along with an annotated test set.
The BWB corpus consists of Chinese novels translated by experts into English, and the annotated test set is designed to probe the ability of machine translation systems to model various discourse phenomena.
arXiv Detail & Related papers (2022-10-26T12:33:53Z) - BSTC: A Large-Scale Chinese-English Speech Translation Dataset [26.633433687767553]
BSTC (Baidu Speech Translation Corpus) is a large-scale Chinese-English speech translation dataset.
This dataset is constructed based on a collection of licensed videos of talks or lectures, including about 68 hours of Mandarin data.
We have asked three experienced interpreters to simultaneously interpret the testing talks in a mock conference setting.
arXiv Detail & Related papers (2021-04-08T07:38:51Z) - Word Alignment by Fine-tuning Embeddings on Parallel Corpora [96.28608163701055]
Word alignment over parallel corpora has a wide variety of applications, including learning translation lexicons, cross-lingual transfer of language processing tools, and automatic evaluation or analysis of translation outputs.
Recently, other work has demonstrated that pre-trained contextualized word embeddings derived from multilingually trained language models (LMs) prove an attractive alternative, achieving competitive results on the word alignment task even in the absence of explicit training on parallel data.
In this paper, we examine methods to marry the two approaches: leveraging pre-trained LMs but fine-tuning them on parallel text with objectives designed to improve alignment quality, and proposing
arXiv Detail & Related papers (2021-01-20T17:54:47Z) - Preparation of Sentiment tagged Parallel Corpus and Testing its effect
on Machine Translation [12.447116722795899]
The paper discusses the preparation of the same sentiment tagged English-Bengali parallel corpus.
The output of the translation model has been compared with a base-line translation model using automated metrics such as BLEU and TER.
arXiv Detail & Related papers (2020-07-28T09:04:47Z) - Contextual Neural Machine Translation Improves Translation of Cataphoric
Pronouns [50.245845110446496]
We investigate the effect of future sentences as context by comparing the performance of a contextual NMT model trained with the future context to the one trained with the past context.
Our experiments and evaluation, using generic and pronoun-focused automatic metrics, show that the use of future context achieves significant improvements over the context-agnostic Transformer.
arXiv Detail & Related papers (2020-04-21T10:45:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.