Related papers: LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation

LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation

URL: http://arxiv.org/abs/2406.01441v2
Date: Tue, 2 Jul 2024 08:00:23 GMT
Title: LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation
Authors: Yongjing Yin, Jiali Zeng, Yafu Li, Fandong Meng, Yue Zhang,
Abstract summary: We present LexMatcher, a method for data curation driven by the coverage of senses found in bilingual dictionaries. Our approach outperforms the established baselines on the WMT2022 test sets.
Score: 67.24113079928668
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The fine-tuning of open-source large language models (LLMs) for machine translation has recently received considerable attention, marking a shift towards data-centric research from traditional neural machine translation. However, the area of data collection for instruction fine-tuning in machine translation remains relatively underexplored. In this paper, we present LexMatcher, a simple yet effective method for data curation, the design of which is driven by the coverage of senses found in bilingual dictionaries. The construction process comprises data retrieval from an existing corpus and data augmentation that supplements the infrequent senses of polysemous words. Utilizing LLaMA2 as our base model, our approach outperforms the established baselines on the WMT2022 test sets and also exhibits remarkable performance in tasks related to word sense disambiguation and specialized terminology translation. These results underscore the effectiveness of LexMatcher in enhancing LLM-based machine translation. The code, data, and models are available at https://github.com/ARIES-LM/Lexmatcher-MT.git.

Related papers

Aligning Large Language Models to Low-Resource Languages through LLM-Based Selective Translation: A Systematic Study [1.0470286407954037]
selective translation is a technique that translates only the translatable parts of a text while preserving non-translatable content and sentence structure.<n>Our experiments focus on the low-resource Indic language Hindi and compare translations generated by Google Cloud Translation (GCP) and Llama-3.1-405B.
arXiv Detail & Related papers (2025-07-18T18:21:52Z)
MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs [54.5729817345543]
MOLE is a framework that automatically extracts metadata attributes from scientific papers covering datasets of languages other than Arabic.<n>Our methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output.
arXiv Detail & Related papers (2025-05-26T10:31:26Z)
English Please: Evaluating Machine Translation for Multilingual Bug Reports [0.0]
This study is the first comprehensive evaluation of machine translation (MT) performance on bug reports. We employ multiple machine translation metrics, including BLEU, BERTScore, COMET, METEOR, and ROUGE. DeepL consistently outperforms the other systems, demonstrating strong lexical and semantic alignment.
arXiv Detail & Related papers (2025-02-20T07:47:03Z)
A comparison of data filtering techniques for English-Polish LLM-based machine translation in the biomedical domain [2.490363177524421]
This paper evaluates the impact of data filtering techniques on English-Polish translation within the biomedical domain. We created varying dataset sizes to fine-tune the mBART50 model, which was then evaluated using the SacreBLEU metric. Our results show that both LASER and MUSE can significantly reduce dataset sizes while maintaining or even enhancing performance.
arXiv Detail & Related papers (2025-01-27T22:12:09Z)
An approach for mistranslation removal from popular dataset for Indic MT Task [5.4755933832880865]
We propose an algorithm to remove mistranslations from the training corpus and evaluate its performance and efficiency. Two Indic languages (ILs), namely, Hindi (HIN) and Odia (ODI) are chosen for the experiment. The quality of the translations in the experiment is evaluated using standard metrics such as BLEU, METEOR, and RIBES.
arXiv Detail & Related papers (2024-01-12T06:37:19Z)
Towards Effective Disambiguation for Machine Translation with Large Language Models [65.80775710657672]
We study the capabilities of large language models to translate "ambiguous sentences" Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions.
arXiv Detail & Related papers (2023-09-20T22:22:52Z)
Sinhala-English Parallel Word Dictionary Dataset [0.554780083433538]
We introduce three parallel English-Sinhala word dictionaries (En-Si-dict-large, En-Si-dict-filtered, En-Si-dict-FastText) which help in multilingual Natural Language Processing (NLP) tasks related to English and Sinhala languages.
arXiv Detail & Related papers (2023-08-04T10:21:35Z)
Bilex Rx: Lexical Data Augmentation for Massively Multilingual Machine Translation [33.6064740446337]
This work explores a cheap and abundant resource to combat this problem: bilingual lexica. We test the efficacy of bilingual lexica in a real-world set-up, on 200-language translation models trained on web-crawled text. We present several findings: (1) using lexical data augmentation, we demonstrate sizable performance gains for unsupervised translation; (2) we compare several families of data augmentation, demonstrating that they yield similar improvements; and (3) we demonstrate the importance of carefully curated lexica over larger, noisier ones.
arXiv Detail & Related papers (2023-03-27T14:54:43Z)
Modeling Sequential Sentence Relation to Improve Cross-lingual Dense Retrieval [87.11836738011007]
We propose a multilingual multilingual language model called masked sentence model (MSM) MSM consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document. To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives.
arXiv Detail & Related papers (2023-02-03T09:54:27Z)
On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks. We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments. We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z)
Mixed Attention Transformer for LeveragingWord-Level Knowledge to Neural Cross-Lingual Information Retrieval [15.902630454568811]
We propose a novel Mixed Attention Transformer (MAT) that incorporates external word level knowledge, such as a dictionary or translation table. By encoding the translation knowledge into an attention matrix, the model with MAT is able to focus on the mutually translated words in the input sequence.
arXiv Detail & Related papers (2021-09-07T00:33:14Z)
XLM-T: Scaling up Multilingual Machine Translation with Pretrained Cross-lingual Transformer Encoders [89.0059978016914]
We present XLM-T, which initializes the model with an off-the-shelf pretrained cross-lingual Transformer and fine-tunes it with multilingual parallel data. This simple method achieves significant improvements on a WMT dataset with 10 language pairs and the OPUS-100 corpus with 94 pairs.
arXiv Detail & Related papers (2020-12-31T11:16:51Z)
Constraint Translation Candidates: A Bridge between Neural Query Translation and Cross-lingual Information Retrieval [45.88734029123836]
We propose a novel approach to alleviate problems by limiting the open target vocabulary search space of QT to a set of important words mined from search index database. The proposed methods are exploited and examined in a real-word CLIR system--Aliexpress e-Commerce search engine.
arXiv Detail & Related papers (2020-10-26T15:27:51Z)
FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning. During inference, the model makes predictions based on the text input in the target language and its translation in the source language. To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.