Dynamic Terminology Integration for COVID-19 and other Emerging Domains
- URL: http://arxiv.org/abs/2109.04708v1
- Date: Fri, 10 Sep 2021 07:23:55 GMT
- Title: Dynamic Terminology Integration for COVID-19 and other Emerging Domains
- Authors: Toms Bergmanis and M\=arcis Pinnis
- Abstract summary: This work is part of WMT 2021 Shared Task: Machine Translation using Terminologies, where we describe Tilde MT systems capable of dynamic terminology integration at the time of translation.
Our systems achieve up to 94% COVID-19 term use accuracy on the test set of the EN-FR language pair without having access to any form of in-domain information during system training.
- Score: 4.492630871726495
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The majority of language domains require prudent use of terminology to ensure
clarity and adequacy of information conveyed. While the correct use of
terminology for some languages and domains can be achieved by adapting
general-purpose MT systems on large volumes of in-domain parallel data, such
quantities of domain-specific data are seldom available for less-resourced
languages and niche domains. Furthermore, as exemplified by COVID-19 recently,
no domain-specific parallel data is readily available for emerging domains.
However, the gravity of this recent calamity created a high demand for reliable
translation of critical information regarding pandemic and infection
prevention. This work is part of WMT2021 Shared Task: Machine Translation using
Terminologies, where we describe Tilde MT systems that are capable of dynamic
terminology integration at the time of translation. Our systems achieve up to
94% COVID-19 term use accuracy on the test set of the EN-FR language pair
without having access to any form of in-domain information during system
training. We conclude our work with a broader discussion considering the Shared
Task itself and terminology translation in MT.
Related papers
- Efficient Terminology Integration for LLM-based Translation in Specialized Domains [0.0]
In specialized fields such as patent, finance, or biomedical domains, terminology is crucial for translation.
We introduce a methodology that efficiently trains models with a smaller amount of data while preserving the accuracy of terminology translation.
This methodology enhances the model's ability to handle specialized terminology and ensures high-quality translations.
arXiv Detail & Related papers (2024-10-21T07:01:25Z) - Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning [55.107329995417786]
Large language models (LLMs) have demonstrated impressive general understanding and generation abilities.
We establish a benchmark for multi-domain translation, featuring 25 German$Leftrightarrow$English and 22 Chinese$Leftrightarrow$English test sets.
We propose a domain Chain of Thought (CoT) fine-tuning technique that utilizes the intrinsic multi-domain intelligence of LLMs to improve translation performance.
arXiv Detail & Related papers (2024-10-03T16:15:04Z) - Fine-tuning Large Language Models for Domain-specific Machine
Translation [8.439661191792897]
Large language models (LLMs) have made significant progress in machine translation (MT)
However, their potential in domain-specific MT remains under-explored.
This paper proposes a prompt-oriented fine-tuning method, denoted as LlamaIT, to effectively and efficiently fine-tune a general-purpose LLM for domain-specific MT tasks.
arXiv Detail & Related papers (2024-02-23T02:24:15Z) - Language Modelling Approaches to Adaptive Machine Translation [0.0]
Consistency is a key requirement of high-quality translation.
In-domain data scarcity is common in translation settings.
Can we employ language models to improve the quality of adaptive MT at inference time?
arXiv Detail & Related papers (2024-01-25T23:02:54Z) - Can Domains Be Transferred Across Languages in Multi-Domain Multilingual
Neural Machine Translation? [52.27798071809941]
This paper investigates whether the domain information can be transferred across languages on the composition of multi-domain and multilingual NMT.
We find that multi-domain multilingual (MDML) NMT can boost zero-shot translation performance up to +10 gains on BLEU.
arXiv Detail & Related papers (2022-10-20T23:13:54Z) - Addressing Issues of Cross-Linguality in Open-Retrieval Question
Answering Systems For Emergent Domains [67.99403521976058]
We demonstrate a cross-lingual open-retrieval question answering system for the emergent domain of COVID-19.
Our system adopts a corpus of scientific articles to ensure that retrieved documents are reliable.
We show that a deep semantic retriever greatly benefits from training on our English-to-all data and significantly outperforms a BM25 baseline in the cross-lingual setting.
arXiv Detail & Related papers (2022-01-26T19:27:32Z) - On the Evaluation of Machine Translation for Terminology Consistency [31.67296249688388]
We propose metrics to measure the consistency of MT output with regards to a domain terminology.
We perform studies on the COVID-19 domain over 5 languages, also performing terminology-targeted human evaluation.
arXiv Detail & Related papers (2021-06-22T15:59:32Z) - FDMT: A Benchmark Dataset for Fine-grained Domain Adaptation in Machine
Translation [53.87731008029645]
We present a real-world fine-grained domain adaptation task in machine translation (FDMT)
The FDMT dataset consists of four sub-domains of information technology: autonomous vehicles, AI education, real-time networks and smart phone.
We make quantitative experiments and deep analyses in this new setting, which benchmarks the fine-grained domain adaptation task.
arXiv Detail & Related papers (2020-12-31T17:15:09Z) - CMT in TREC-COVID Round 2: Mitigating the Generalization Gaps from Web
to Special Domain Search [89.48123965553098]
This paper presents a search system to alleviate the special domain adaption problem.
The system utilizes the domain-adaptive pretraining and few-shot learning technologies to help neural rankers mitigate the domain discrepancy.
Our system performs the best among the non-manual runs in Round 2 of the TREC-COVID task.
arXiv Detail & Related papers (2020-11-03T09:10:48Z) - Iterative Domain-Repaired Back-Translation [50.32925322697343]
In this paper, we focus on the domain-specific translation with low resources, where in-domain parallel corpora are scarce or nonexistent.
We propose a novel iterative domain-repaired back-translation framework, which introduces the Domain-Repair model to refine translations in synthetic bilingual data.
Experiments on adapting NMT models between specific domains and from the general domain to specific domains demonstrate the effectiveness of our proposed approach.
arXiv Detail & Related papers (2020-10-06T04:38:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.