Identifying Context-Dependent Translations for Evaluation Set Production
- URL: http://arxiv.org/abs/2311.02321v1
- Date: Sat, 4 Nov 2023 04:29:08 GMT
- Title: Identifying Context-Dependent Translations for Evaluation Set Production
- Authors: Rachel Wicks, Matt Post
- Abstract summary: A major impediment to the transition to context-aware machine translation is the absence of good evaluation metrics and test sets.
We produce CTXPRO, a tool that identifies subsets of parallel documents containing sentences that require context to translate five phenomena.
The input to the pipeline is a set of hand-crafted, per-language, linguistically-informed rules that select contextual sentence pairs.
- Score: 11.543673351369183
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A major impediment to the transition to context-aware machine translation is
the absence of good evaluation metrics and test sets. Sentences that require
context to be translated correctly are rare in test sets, reducing the utility
of standard corpus-level metrics such as COMET or BLEU. On the other hand,
datasets that annotate such sentences are also rare, small in scale, and
available for only a few languages. To address this, we modernize, generalize,
and extend previous annotation pipelines to produce CTXPRO, a tool that
identifies subsets of parallel documents containing sentences that require
context to correctly translate five phenomena: gender, formality, and animacy
for pronouns, verb phrase ellipsis, and ambiguous noun inflections. The input
to the pipeline is a set of hand-crafted, per-language, linguistically-informed
rules that select contextual sentence pairs using coreference, part-of-speech,
and morphological features provided by state-of-the-art tools. We apply this
pipeline to seven languages pairs (EN into and out-of DE, ES, FR, IT, PL, PT,
and RU) and two datasets (OpenSubtitles and WMT test sets), and validate its
performance using both overlap with previous work and its ability to
discriminate a contextual MT system from a sentence-based one. We release the
CTXPRO pipeline and data as open source.
Related papers
- Pipeline and Dataset Generation for Automated Fact-checking in Almost
Any Language [0.0]
This article presents a pipeline for automated fact-checking leveraging publicly available Language Models and data.
The pipeline consists of two main modules -- the evidence retrieval and the claim veracity evaluation.
We provide open access to all data and fine-tuned models for Czech, English, Polish, and Slovak pipelines.
arXiv Detail & Related papers (2023-12-15T19:43:41Z) - Discourse Centric Evaluation of Machine Translation with a Densely
Annotated Parallel Corpus [82.07304301996562]
This paper presents a new dataset with rich discourse annotations, built upon the large-scale parallel corpus BWB introduced in Jiang et al.
We investigate the similarities and differences between the discourse structures of source and target languages.
We discover that MT outputs differ fundamentally from human translations in terms of their latent discourse structures.
arXiv Detail & Related papers (2023-05-18T17:36:41Z) - Are the Best Multilingual Document Embeddings simply Based on Sentence
Embeddings? [18.968571816913208]
We provide a systematic comparison of methods to produce document-level representations from sentences based on LASER, LaBSE, and Sentence BERT pre-trained multilingual models.
We show that a clever combination of sentence embeddings is usually better than encoding the full document as a single unit.
arXiv Detail & Related papers (2023-04-28T12:11:21Z) - Statistical Machine Translation for Indic Languages [1.8899300124593648]
This paper canvasses about the development of bilingual Statistical Machine Translation models.
To create the system, MOSES open-source SMT toolkit is explored.
In our experiment, the quality of the translation is evaluated using standard metrics such as BLEU, METEOR, and RIBES.
arXiv Detail & Related papers (2023-01-02T06:23:12Z) - Understanding Translationese in Cross-Lingual Summarization [106.69566000567598]
Cross-lingual summarization (MS) aims at generating a concise summary in a different target language.
To collect large-scale CLS data, existing datasets typically involve translation in their creation.
In this paper, we first confirm that different approaches of constructing CLS datasets will lead to different degrees of translationese.
arXiv Detail & Related papers (2022-12-14T13:41:49Z) - When Does Translation Require Context? A Data-driven, Multilingual
Exploration [71.43817945875433]
proper handling of discourse significantly contributes to the quality of machine translation (MT)
Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation.
We develop the Multilingual Discourse-Aware benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena.
arXiv Detail & Related papers (2021-09-15T17:29:30Z) - Sentiment-based Candidate Selection for NMT [2.580271290008534]
We propose a decoder-side approach that incorporates automatic sentiment scoring into the machine translation (MT) candidate selection process.
We train separate English and Spanish sentiment classifiers, then, using n-best candidates generated by a baseline MT model with beam search, select the candidate that minimizes the absolute difference between the sentiment score of the source sentence and that of the translation.
The results of human evaluations show that, in comparison to the open-source MT model on top of which our pipeline is built, our baseline translations are more accurate of colloquial, sentiment-heavy source texts.
arXiv Detail & Related papers (2021-04-10T19:01:52Z) - Divide and Rule: Training Context-Aware Multi-Encoder Translation Models
with Little Resources [20.057692375546356]
Multi-encoder models aim to improve translation quality by encoding document-level contextual information alongside the current sentence.
We show that training these parameters takes large amount of data, since the contextual training signal is sparse.
We propose an efficient alternative, based on splitting sentence pairs, that allows to enrich the training signal of a set of parallel sentences.
arXiv Detail & Related papers (2021-03-31T15:15:32Z) - Unsupervised Bitext Mining and Translation via Self-trained Contextual
Embeddings [51.47607125262885]
We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text.
We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model via self-training.
We validate our technique by extracting parallel sentence pairs on the BUCC 2017 bitext mining task and observe up to a 24.5 point increase (absolute) in F1 scores over previous unsupervised methods.
arXiv Detail & Related papers (2020-10-15T14:04:03Z) - On the Limitations of Cross-lingual Encoders as Exposed by
Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity.
This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations.
We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER.
We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z) - Learning Contextualized Sentence Representations for Document-Level
Neural Machine Translation [59.191079800436114]
Document-level machine translation incorporates inter-sentential dependencies into the translation of a source sentence.
We propose a new framework to model cross-sentence dependencies by training neural machine translation (NMT) to predict both the target translation and surrounding sentences of a source sentence.
arXiv Detail & Related papers (2020-03-30T03:38:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.