Optimizing Large Language Models for Turkish: New Methodologies in Corpus Selection and Training
- URL: http://arxiv.org/abs/2412.02775v1
- Date: Tue, 03 Dec 2024 19:17:18 GMT
- Title: Optimizing Large Language Models for Turkish: New Methodologies in Corpus Selection and Training
- Authors: H. Toprak Kesgin, M. Kaan Yuce, Eren Dogan, M. Egemen Uzun, Atahan Uz, Elif Ince, Yusuf Erdem, Osama Shbib, Ahmed Zeer, M. Fatih Amasyali,
- Abstract summary: We adapt Large Language Model generated datasets and translated English datasets into Turkish.
This approach led to substantial enhancements in model accuracy for both few-shot and zero-shot learning scenarios.
- Score: 0.0
- License:
- Abstract: In this study, we develop and assess new corpus selection and training methodologies to improve the effectiveness of Turkish language models. Specifically, we adapted Large Language Model generated datasets and translated English datasets into Turkish, integrating these resources into the training process. This approach led to substantial enhancements in model accuracy for both few-shot and zero-shot learning scenarios. Furthermore, the merging of these adapted models was found to markedly improve their performance. Human evaluative metrics, including task-specific performance assessments, further demonstrated that these adapted models possess a greater aptitude for comprehending the Turkish language and addressing logic-based queries. This research underscores the importance of refining corpus selection strategies to optimize the performance of multilingual models, particularly for under-resourced languages like Turkish.
Related papers
- MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting [53.77590764277568]
We introduce a novel MoE-CT architecture that separates the base model's learning from the multilingual expansion process.
Our design freezes the original LLM parameters, thus safeguarding its performance in high-resource languages, while an appended MoE module, trained on diverse language datasets, augments low-resource language proficiency.
arXiv Detail & Related papers (2024-06-25T11:03:45Z) - Bridging the Bosphorus: Advancing Turkish Large Language Models through Strategies for Low-Resource Language Adaptation and Benchmarking [1.3716808114696444]
Large Language Models (LLMs) are becoming crucial across various fields, emphasizing the urgency for high-quality models in underrepresented languages.
This study explores the unique challenges faced by low-resource languages, such as data scarcity, model selection, evaluation, and computational limitations.
arXiv Detail & Related papers (2024-05-07T21:58:45Z) - Introducing cosmosGPT: Monolingual Training for Turkish Language Models [0.0]
This study introduces the cosmosGPT models that we created with this alternative method.
We then introduce new finetune datasets for basic language models to fulfill user requests and new evaluation datasets for measuring the capabilities of Turkish language models.
The results show that the language models we built with the monolingual corpus have promising performance despite being about 10 times smaller than the others.
arXiv Detail & Related papers (2024-04-26T11:34:11Z) - Türkçe Dil Modellerinin Performans Karşılaştırması Performance Comparison of Turkish Language Models [0.0]
A comparison is made among seven selected language models based on their contextual learning and question-answering abilities.
The results show that for question-answering, continuing pretraining before fine-tuning with instructional datasets is more successful in adapting multilingual models to Turkish.
arXiv Detail & Related papers (2024-04-25T20:10:14Z) - Visualizing the Relationship Between Encoded Linguistic Information and
Task Performance [53.223789395577796]
We study the dynamic relationship between the encoded linguistic information and task performance from the viewpoint of Pareto Optimality.
We conduct experiments on two popular NLP tasks, i.e., machine translation and language modeling, and investigate the relationship between several kinds of linguistic information and task performances.
Our empirical findings suggest that some syntactic information is helpful for NLP tasks whereas encoding more syntactic information does not necessarily lead to better performance.
arXiv Detail & Related papers (2022-03-29T19:03:10Z) - Curriculum learning for language modeling [2.2475845406292714]
Language models have proven transformational for the natural language processing community.
These models have proven expensive, energy-intensive, and challenging to train.
Curriculum learning is a method that employs a structured training regime instead.
arXiv Detail & Related papers (2021-08-04T16:53:43Z) - Specializing Multilingual Language Models: An Empirical Study [50.7526245872855]
Contextualized word representations from pretrained multilingual language models have become the de facto standard for addressing natural language tasks.
For languages rarely or never seen by these models, directly using such models often results in suboptimal representation or use of data.
arXiv Detail & Related papers (2021-06-16T18:13:55Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z) - Dynamic Data Selection and Weighting for Iterative Back-Translation [116.14378571769045]
We propose a curriculum learning strategy for iterative back-translation models.
We evaluate our models on domain adaptation, low-resource, and high-resource MT settings.
Experimental results demonstrate that our methods achieve improvements of up to 1.8 BLEU points over competitive baselines.
arXiv Detail & Related papers (2020-04-07T19:49:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.