SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models
- URL: http://arxiv.org/abs/2507.18902v1
- Date: Fri, 25 Jul 2025 02:51:14 GMT
- Title: SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models
- Authors: Hongyuan Lu, Zixuan Li, Zefan Zhang, Wai Lam,
- Abstract summary: This paper proposes a novel task called textbfAutomatic textbfDictionary textbfSelection (textbfADS)<n>The goal of the task is to automatically select which dictionary to use to enhance translation.
- Score: 47.604473591750605
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There are more than 7,000 languages around the world, and current Large Language Models (LLMs) only support hundreds of languages. Dictionary-based prompting methods can enhance translation on them, but most methods use all the available dictionaries, which could be expensive. Instead, it will be flexible to have a trade-off between token consumption and translation performance. This paper proposes a novel task called \textbf{A}utomatic \textbf{D}ictionary \textbf{S}election (\textbf{ADS}). The goal of the task is to automatically select which dictionary to use to enhance translation. We propose a novel and effective method which we call \textbf{S}elect \textbf{Lo}w-frequency \textbf{W}ords! (\textbf{SLoW}) which selects those dictionaries that have a lower frequency. Our methods have unique advantages. First, there is no need for access to the training data for frequency estimation (which is usually unavailable). Second, it inherits the advantage of dictionary-based methods, where no additional tuning is required on LLMs. Experimental results on 100 languages from FLORES indicate that SLoW surpasses strong baselines, and it can obviously save token usage, with many languages even surpassing the translation performance of the full dictionary baseline.\footnote{A shocking fact is that there is no need to use the actual training data (often unobtainable) for frequency estimation, and an estimation frequency obtained using public resources is still apparently effective in improving translation with ChatGPT and Llama, and DeepSeek.}\footnote{Code and data available upon publication.}
Related papers
- Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries [22.562544826766917]
Cross-lingual vocabulary transfer plays a promising role in adapting pre-trained language models to new languages.<n>Existing approaches that utilize monolingual or parallel corpora face challenges when applied to languages with limited resources.
arXiv Detail & Related papers (2025-06-02T10:52:52Z) - Efficient Continual Pre-training of LLMs for Low-resource Languages [45.44796295841526]
We develop a new algorithm to select a subset of texts from a larger corpus.<n>In search of further improvement, we design a new algorithm to select tokens to include in the LLM vocabulary.
arXiv Detail & Related papers (2024-12-13T16:13:35Z) - Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models [52.00446751692225]
We present a novel and simple yet effective method called textbfDictionary textbfInsertion textbfPrompting (textbfDIP)
When providing a non-English prompt, DIP looks up a word dictionary and inserts words' English counterparts into the prompt for LLMs.
It then enables better translation into English and better English model thinking steps which leads to obviously better results.
arXiv Detail & Related papers (2024-11-02T05:10:50Z) - How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text? [38.1823640848362]
Large language models (LLMs) have shown remarkable capabilities in many languages beyond English.
LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers and vocabulary.
Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue.
arXiv Detail & Related papers (2024-06-17T12:42:34Z) - LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation [67.24113079928668]
We present LexMatcher, a method for data curation driven by the coverage of senses found in bilingual dictionaries.
Our approach outperforms the established baselines on the WMT2022 test sets.
arXiv Detail & Related papers (2024-06-03T15:30:36Z) - Translate to Disambiguate: Zero-shot Multilingual Word Sense
Disambiguation with Pretrained Language Models [67.19567060894563]
Pretrained Language Models (PLMs) learn rich cross-lingual knowledge and can be finetuned to perform well on diverse tasks.
We present a new study investigating how well PLMs capture cross-lingual word sense with Contextual Word-Level Translation (C-WLT)
We find that as the model size increases, PLMs encode more cross-lingual word sense knowledge and better use context to improve WLT performance.
arXiv Detail & Related papers (2023-04-26T19:55:52Z) - DICTDIS: Dictionary Constrained Disambiguation for Improved NMT [50.888881348723295]
We present DictDis, a lexically constrained NMT system that disambiguates between multiple candidate translations derived from dictionaries.
We demonstrate the utility of DictDis via extensive experiments on English-Hindi and English-German sentences in a variety of domains including regulatory, finance, engineering.
arXiv Detail & Related papers (2022-10-13T13:04:16Z) - Dict-NMT: Bilingual Dictionary based NMT for Extremely Low Resource
Languages [1.8787713898828164]
We present a detailed analysis of the effects of the quality of dictionaries, training dataset size, language family, etc., on the translation quality.
Results on multiple low-resource test languages show a clear advantage of our bilingual dictionary-based method over the baselines.
arXiv Detail & Related papers (2022-06-09T12:03:29Z) - Cross-lingual Transfer for Text Classification with Dictionary-based
Heterogeneous Graph [10.64488240379972]
In cross-lingual text classification, it is required that task-specific training data in high-resource source languages are available.
Collecting such training data can be infeasible because of the labeling cost, task characteristics, and privacy concerns.
This paper proposes an alternative solution that uses only task-independent word embeddings of high-resource languages and bilingual dictionaries.
arXiv Detail & Related papers (2021-09-09T16:40:40Z) - Active Learning for Massively Parallel Translation of Constrained Text
into Low Resource Languages [26.822210580244885]
We translate a closed text that is known in advance and available in many languages into a new and severely low resource language.
We compare the portion-based approach that optimize coherence of the text locally with the random sampling approach that increases coverage of the text globally.
We propose an algorithm for human and machine to work together seamlessly to translate a closed text into a severely low resource language.
arXiv Detail & Related papers (2021-08-16T14:49:50Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.