WeTS: A Benchmark for Translation Suggestion
- URL: http://arxiv.org/abs/2110.05151v1
- Date: Mon, 11 Oct 2021 10:52:17 GMT
- Title: WeTS: A Benchmark for Translation Suggestion
- Authors: Zhen Yang, Yingxue Zhang, Ernan Li, Fandong Meng and Jie Zhou
- Abstract summary: We create a benchmark data set for Translation Suggestion (TS) called emphWeTS.
We also propose several novel methods to generate synthetic corpus which can substantially improve the performance of TS.
Our model achieves State-Of-The-Art (SOTA) results on all four translation directions, including English-to-German, German-to-English, Chinese-to-English and English-to-Chinese.
- Score: 32.10692757420455
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Translation Suggestion (TS), which provides alternatives for specific words
or phrases given the entire documents translated by machine translation (MT)
\cite{lee2021intellicat}, has been proven to play a significant role in post
editing (PE). However, there is still no publicly available data set to support
in-depth research for this problem, and no reproducible experimental results
can be followed by researchers in this community. To break this limitation, we
create a benchmark data set for TS, called \emph{WeTS}, which contains golden
corpus annotated by expert translators on four translation directions. Apart
from the human-annotated golden corpus, we also propose several novel methods
to generate synthetic corpus which can substantially improve the performance of
TS. With the corpus we construct, we introduce the Transformer-based model for
TS, and experimental results show that our model achieves State-Of-The-Art
(SOTA) results on all four translation directions, including English-to-German,
German-to-English, Chinese-to-English and English-to-Chinese. Codes and corpus
can be found at \url{https://github.com/ZhenYangIACAS/WeTS.git}.
Related papers
- Contextual Refinement of Translations: Large Language Models for Sentence and Document-Level Post-Editing [12.843274390224853]
Large Language Models (LLM's) have demonstrated considerable success in various Natural Language Processing tasks.
We show that they have yet to attain state-of-the-art performance in Neural Machine Translation.
We propose adapting LLM's as Automatic Post-Editors (APE) rather than direct translators.
arXiv Detail & Related papers (2023-10-23T12:22:15Z) - ParroT: Translating during Chat using Large Language Models tuned with
Human Translation and Feedback [90.20262941911027]
ParroT is a framework to enhance and regulate the translation abilities during chat.
Specifically, ParroT reformulates translation data into the instruction-following style.
We propose three instruction types for finetuning ParroT models, including translation instruction, contrastive instruction, and error-guided instruction.
arXiv Detail & Related papers (2023-04-05T13:12:00Z) - Rethink about the Word-level Quality Estimation for Machine Translation
from Human Judgement [57.72846454929923]
We create a benchmark dataset, emphHJQE, where the expert translators directly annotate poorly translated words.
We propose two tag correcting strategies, namely tag refinement strategy and tree-based annotation strategy, to make the TER-based artificial QE corpus closer to emphHJQE.
The results show our proposed dataset is more consistent with human judgement and also confirm the effectiveness of the proposed tag correcting strategies.
arXiv Detail & Related papers (2022-09-13T02:37:12Z) - GigaST: A 10,000-hour Pseudo Speech Translation Corpus [33.572303016012384]
GigaST is a large-scale pseudo speech translation (ST) corpus.
We create the corpus by translating the text in GigaSpeech, an English ASR corpus, into German and Chinese.
The training set is translated by a strong machine translation system and the test set is translated by human.
arXiv Detail & Related papers (2022-04-08T08:59:33Z) - BitextEdit: Automatic Bitext Editing for Improved Low-Resource Machine
Translation [53.55009917938002]
We propose to refine the mined bitexts via automatic editing.
Experiments demonstrate that our approach successfully improves the quality of CCMatrix mined bitext for 5 low-resource language-pairs and 10 translation directions by up to 8 BLEU points.
arXiv Detail & Related papers (2021-11-12T16:00:39Z) - ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality
Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee.
It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z) - "Listen, Understand and Translate": Triple Supervision Decouples
End-to-end Speech-to-text Translation [49.610188741500274]
An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language.
Existing methods are limited by the amount of parallel corpus.
We build a system to fully utilize signals in a parallel ST corpus.
arXiv Detail & Related papers (2020-09-21T09:19:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.