Related papers: Introducing cosmosGPT: Monolingual Training for Turkish Language Models

Introducing cosmosGPT: Monolingual Training for Turkish Language Models

URL: http://arxiv.org/abs/2404.17336v1
Date: Fri, 26 Apr 2024 11:34:11 GMT
Title: Introducing cosmosGPT: Monolingual Training for Turkish Language Models
Authors: H. Toprak Kesgin, M. Kaan Yuce, Eren Dogan, M. Egemen Uzun, Atahan Uz, H. Emre Seyrek, Ahmed Zeer, M. Fatih Amasyali,
Abstract summary: This study introduces the cosmosGPT models that we created with this alternative method. We then introduce new finetune datasets for basic language models to fulfill user requests and new evaluation datasets for measuring the capabilities of Turkish language models. The results show that the language models we built with the monolingual corpus have promising performance despite being about 10 times smaller than the others.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The number of open source language models that can produce Turkish is increasing day by day, as in other languages. In order to create the basic versions of such models, the training of multilingual models is usually continued with Turkish corpora. The alternative is to train the model with only Turkish corpora. In this study, we first introduce the cosmosGPT models that we created with this alternative method. Then, we introduce new finetune datasets for basic language models to fulfill user requests and new evaluation datasets for measuring the capabilities of Turkish language models. Finally, a comprehensive comparison of the adapted Turkish language models on different capabilities is presented. The results show that the language models we built with the monolingual corpus have promising performance despite being about 10 times smaller than the others.

Related papers

TurkBench: A Benchmark for Evaluating Turkish Large Language Models [0.0]
TurkBench is a benchmark designed to assess the capabilities of generative large language models in the Turkish language.<n>It involves 8,151 data samples across 21 distinct subtasks.<n>The diverse range of tasks and the culturally relevant data would provide researchers and developers with a valuable tool for evaluating their models.
arXiv Detail & Related papers (2026-01-11T18:28:23Z)
Türkçe Dil Modellerinin Performans Karşılaştırması Performance Comparison of Turkish Language Models [0.0]
A comparison is made among seven selected language models based on their contextual learning and question-answering abilities. The results show that for question-answering, continuing pretraining before fine-tuning with instructional datasets is more successful in adapting multilingual models to Turkish.
arXiv Detail & Related papers (2024-04-25T20:10:14Z)
CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models [59.91221728187576]
This paper introduces the CMU Linguistic Linguistic Backend, an open-source framework that simplifies model deployment and continuous human-in-the-loop fine-tuning of NLP models. CMULAB enables users to leverage the power of multilingual models to quickly adapt and extend existing tools for speech recognition, OCR, translation, and syntactic analysis to new languages.
arXiv Detail & Related papers (2024-04-03T02:21:46Z)
The Less the Merrier? Investigating Language Representation in Multilingual Models [8.632506864465501]
We investigate the linguistic representation of different languages in multilingual models. We observe from our experiments that community-centered models perform better at distinguishing between languages in the same family for low-resource languages.
arXiv Detail & Related papers (2023-10-20T02:26:34Z)
Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X. These datasets cover over 10 programming languages. We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z)
Continual Learning in Multilingual NMT via Language-Specific Embeddings [92.91823064720232]
It consists in replacing the shared vocabulary with a small language-specific vocabulary and fine-tuning the new embeddings on the new language's parallel data. Because the parameters of the original model are not modified, its performance on the initial languages does not degrade.
arXiv Detail & Related papers (2021-10-20T10:38:57Z)
Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan [0.05277024349608833]
This work focuses on Catalan with the aim of exploring what extent a medium-sized monolingual language model is competitive with state-of-the-art large multilingual models. We build a clean, high-quality textual Catalan corpus (CaText), train a Transformer-based language model for Catalan (BERTa), and devise a thorough evaluation in a diversity of settings. The result is a new benchmark, the Catalan Language Understanding Benchmark (CLUB), which we publish as an open resource.
arXiv Detail & Related papers (2021-07-16T13:52:01Z)
Comparison of Interactive Knowledge Base Spelling Correction Models for Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict. This work shows a comparison of a neural model and character language models with varying amounts on target language data. Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
WikiBERT models: deep transfer learning for many languages [1.3455090151301572]
We introduce a simple, fully automated pipeline for creating languagespecific BERT models from Wikipedia data. We assess the merits of these models using the state-of-the-art UDify on Universal Dependencies data.
arXiv Detail & Related papers (2020-06-02T11:57:53Z)
ParsBERT: Transformer-based Model for Persian Language Understanding [0.7646713951724012]
This paper proposes a monolingual BERT for the Persian language (ParsBERT) It shows its state-of-the-art performance compared to other architectures and multilingual models. ParsBERT obtains higher scores in all datasets, including existing ones as well as composed ones.
arXiv Detail & Related papers (2020-05-26T05:05:32Z)
Structure-Level Knowledge Distillation For Multilingual Sequence Labeling [73.40368222437912]
We propose to reduce the gap between monolingual models and the unified multilingual model by distilling the structural knowledge of several monolingual models to the unified multilingual model (student) Our experiments on 4 multilingual tasks with 25 datasets show that our approaches outperform several strong baselines and have stronger zero-shot generalizability than both the baseline model and teacher models.
arXiv Detail & Related papers (2020-04-08T07:14:01Z)
XPersona: Evaluating Multilingual Personalized Chatbot [76.00426517401894]
We propose a multi-lingual extension of Persona-Chat, namely XPersona. Our dataset includes persona conversations in six different languages other than English for building and evaluating multilingual personalized agents.
arXiv Detail & Related papers (2020-03-17T07:52:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.