A comparison of data filtering techniques for English-Polish LLM-based machine translation in the biomedical domain
- URL: http://arxiv.org/abs/2501.16533v1
- Date: Mon, 27 Jan 2025 22:12:09 GMT
- Title: A comparison of data filtering techniques for English-Polish LLM-based machine translation in the biomedical domain
- Authors: Jorge del Pozo Lérida, Kamil Kojs, János Máté, Mikołaj Antoni Barański, Christian Hardmeier,
- Abstract summary: This paper evaluates the impact of data filtering techniques on English-Polish translation within the biomedical domain.
We created varying dataset sizes to fine-tune the mBART50 model, which was then evaluated using the SacreBLEU metric.
Our results show that both LASER and MUSE can significantly reduce dataset sizes while maintaining or even enhancing performance.
- Score: 2.490363177524421
- License:
- Abstract: Large Language Models (LLMs) have become state-of-the-art in Machine Translation (MT), often trained on massive bilingual parallel corpora scraped from the web, that contain low-quality entries and redundant information, leading to significant computational challenges. Various data filtering methods exist to reduce dataset sizes, but their effectiveness largely varies based on specific language pairs and domains. This paper evaluates the impact of commonly used data filtering techniques, such as LASER, MUSE, and LaBSE, on English-Polish translation within the biomedical domain. By filtering the UFAL Medical Corpus, we created varying dataset sizes to fine-tune the mBART50 model, which was then evaluated using the SacreBLEU metric on the Khresmoi dataset, having the quality of translations assessed by bilingual speakers. Our results show that both LASER and MUSE can significantly reduce dataset sizes while maintaining or even enhancing performance. We recommend the use of LASER, as it consistently outperforms the other methods and provides the most fluent and natural-sounding translations.
Related papers
- ScalingFilter: Assessing Data Quality through Inverse Utilization of Scaling Laws [67.59263833387536]
ScalingFilter is a novel approach that evaluates text quality based on the perplexity difference between two language models trained on the same data.
To assess the bias introduced by quality filtering, we introduce semantic diversity, a metric of utilizing text embedding models for semantic representations.
arXiv Detail & Related papers (2024-08-15T17:59:30Z) - LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation [67.24113079928668]
We present LexMatcher, a method for data curation driven by the coverage of senses found in bilingual dictionaries.
Our approach outperforms the established baselines on the WMT2022 test sets.
arXiv Detail & Related papers (2024-06-03T15:30:36Z) - An approach for mistranslation removal from popular dataset for Indic MT
Task [5.4755933832880865]
We propose an algorithm to remove mistranslations from the training corpus and evaluate its performance and efficiency.
Two Indic languages (ILs), namely, Hindi (HIN) and Odia (ODI) are chosen for the experiment.
The quality of the translations in the experiment is evaluated using standard metrics such as BLEU, METEOR, and RIBES.
arXiv Detail & Related papers (2024-01-12T06:37:19Z) - There's no Data Like Better Data: Using QE Metrics for MT Data Filtering [25.17221095970304]
We analyze the viability of using QE metrics for filtering out bad quality sentence pairs in the training data of neural machine translation systems(NMT)
We show that by selecting the highest quality sentence pairs in the training data, we can improve translation quality while reducing the training size by half.
arXiv Detail & Related papers (2023-11-09T13:21:34Z) - When Does Monolingual Data Help Multilingual Translation: The Role of Domain and Model Scale [73.69252847606212]
We examine how denoising autoencoding (DAE) and backtranslation (BT) impact machine translation (MMT)
We find that monolingual data generally helps MMT, but models are surprisingly brittle to domain mismatches, especially at smaller model scales.
As scale increases, DAE transitions from underperforming the parallel-only baseline at 90M to converging with BT performance at 1.6B, and even surpassing it in low-resource.
arXiv Detail & Related papers (2023-05-23T14:48:42Z) - Improving Multilingual Translation by Representation and Gradient
Regularization [82.42760103045083]
We propose a joint approach to regularize NMT models at both representation-level and gradient-level.
Our results demonstrate that our approach is highly effective in both reducing off-target translation occurrences and improving zero-shot translation performance.
arXiv Detail & Related papers (2021-09-10T10:52:21Z) - Distributionally Robust Multilingual Machine Translation [94.51866646879337]
We propose a new learning objective for Multilingual neural machine translation (MNMT) based on distributionally robust optimization.
We show how to practically optimize this objective for large translation corpora using an iterated best response scheme.
Our method consistently outperforms strong baseline methods in terms of average and per-language performance under both many-to-one and one-to-many translation settings.
arXiv Detail & Related papers (2021-09-09T03:48:35Z) - On the Language Coverage Bias for Neural Machine Translation [81.81456880770762]
Language coverage bias is important for neural machine translation (NMT) because the target-original training data is not well exploited in current practice.
By carefully designing experiments, we provide comprehensive analyses of the language coverage bias in the training data.
We propose two simple and effective approaches to alleviate the language coverage bias problem.
arXiv Detail & Related papers (2021-06-07T01:55:34Z) - El Departamento de Nosotros: How Machine Translated Corpora Affects
Language Models in MRC Tasks [0.12183405753834563]
Pre-training large-scale language models (LMs) requires huge amounts of text corpora.
We study the caveats of applying directly translated corpora for fine-tuning LMs for downstream natural language processing tasks.
We show that careful curation along with post-processing lead to improved performance and overall LMs robustness.
arXiv Detail & Related papers (2020-07-03T22:22:44Z) - Selecting Backtranslated Data from Multiple Sources for Improved Neural
Machine Translation [8.554761233491236]
We analyse the impact that data translated with rule-based, phrase-based statistical and neural MT systems has on new MT systems.
We exploit different data selection strategies in order to reduce the amount of data used, while at the same time maintaining high-quality MT systems.
arXiv Detail & Related papers (2020-05-01T10:50:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.