Controlling Extra-Textual Attributes about Dialogue Participants: A Case
Study of English-to-Polish Neural Machine Translation
- URL: http://arxiv.org/abs/2205.04747v1
- Date: Tue, 10 May 2022 08:45:39 GMT
- Title: Controlling Extra-Textual Attributes about Dialogue Participants: A Case
Study of English-to-Polish Neural Machine Translation
- Authors: Sebastian T. Vincent, Lo\"ic Barrault, Carolina Scarton
- Abstract summary: Machine translation models need to opt for a certain interpretation of textual context when translating from English to Polish.
We propose a case study where a wide range of approaches for controlling attributes in translation is employed.
The best model achieves an improvement of +5.81 chrF++/+6.03 BLEU, with other models achieving competitive performance.
- Score: 4.348327991071386
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Unlike English, morphologically rich languages can reveal characteristics of
speakers or their conversational partners, such as gender and number, via
pronouns, morphological endings of words and syntax. When translating from
English to such languages, a machine translation model needs to opt for a
certain interpretation of textual context, which may lead to serious
translation errors if extra-textual information is unavailable. We investigate
this challenge in the English-to-Polish language direction. We focus on the
underresearched problem of utilising external metadata in automatic translation
of TV dialogue, proposing a case study where a wide range of approaches for
controlling attributes in translation is employed in a multi-attribute
scenario. The best model achieves an improvement of +5.81 chrF++/+6.03 BLEU,
with other models achieving competitive performance. We additionally contribute
a novel attribute-annotated dataset of Polish TV dialogue and a morphological
analysis script used to evaluate attribute control in models.
Related papers
- Learning Phonotactics from Linguistic Informants [54.086544221761486]
Our model iteratively selects or synthesizes a data-point according to one of a range of information-theoretic policies.
We find that the information-theoretic policies that our model uses to select items to query the informant achieve sample efficiency comparable to, or greater than, fully supervised approaches.
arXiv Detail & Related papers (2024-05-08T00:18:56Z) - Context-aware Neural Machine Translation for English-Japanese Business
Scene Dialogues [14.043741721036543]
This paper explores how context-awareness can improve the performance of the current Neural Machine Translation (NMT) models for English-Japanese business dialogues translation.
We propose novel context tokens encoding extra-sentential information, such as speaker turn and scene type.
We find that models leverage both preceding sentences and extra-sentential context (with CXMI increasing with context size) and we provide a more focused analysis on honorifics translation.
arXiv Detail & Related papers (2023-11-20T18:06:03Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - A Corpus for Sentence-level Subjectivity Detection on English News Articles [49.49218203204942]
We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics.
Our corpus paves the way for subjectivity detection in English without relying on language-specific tools, such as lexicons or machine translation.
arXiv Detail & Related papers (2023-05-29T11:54:50Z) - Automated stance detection in complex topics and small languages: the
challenging case of immigration in polarizing news media [0.0]
This paper explores the applicability of large language models for automated stance detection in a challenging scenario.
It involves a morphologically complex, lower-resource language, and a socio-culturally complex topic, immigration.
If the approach works in this case, it can be expected to perform as well or better in less demanding scenarios.
arXiv Detail & Related papers (2023-05-22T13:56:35Z) - Reference-less Analysis of Context Specificity in Translation with
Personalised Language Models [3.527589066359829]
This work investigates what extent rich character and film annotations can be leveraged to personalise language models (LMs)
We build LMs which leverage rich contextual information to reduce perplexity by up to 6.5% compared to a non-contextual model.
Our results suggest that the degree to which professional translations in our domain are context-specific can be preserved to a better extent by a contextual machine translation model.
arXiv Detail & Related papers (2023-03-29T12:19:23Z) - Corpus-Guided Contrast Sets for Morphosyntactic Feature Detection in
Low-Resource English Varieties [3.3536302616846734]
We present a human-in-the-loop approach to generate and filter effective contrast sets via corpus-guided edits.
We show that our approach improves feature detection for both Indian English and African American English, demonstrate how it can assist linguistic research, and release our fine-tuned models for use by other researchers.
arXiv Detail & Related papers (2022-09-15T21:19:31Z) - Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation [70.81596088969378]
Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding.
COD enables dialogue state tracking, and end-to-end dialogue modelling and evaluation in 4 diverse languages.
arXiv Detail & Related papers (2022-01-31T18:11:21Z) - ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality
Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee.
It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z) - Modeling Bilingual Conversational Characteristics for Neural Chat
Translation [24.94474722693084]
We aim to promote the translation quality of conversational text by modeling the above properties.
We evaluate our approach on the benchmark dataset BConTrasT (English-German) and a self-collected bilingual dialogue corpus, named BMELD (English-Chinese)
Our approach notably boosts the performance over strong baselines by a large margin and significantly surpasses some state-of-the-art context-aware NMT models in terms of BLEU and TER.
arXiv Detail & Related papers (2021-07-23T12:23:34Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.