Related papers: Cerbero-7B: A Leap Forward in Language-Specific LLMs Through Enhanced Chat Corpus Generation and Evaluation

Cerbero-7B: A Leap Forward in Language-Specific LLMs Through Enhanced Chat Corpus Generation and Evaluation

URL: http://arxiv.org/abs/2311.15698v1
Date: Mon, 27 Nov 2023 10:34:55 GMT
Title: Cerbero-7B: A Leap Forward in Language-Specific LLMs Through Enhanced Chat Corpus Generation and Evaluation
Authors: Federico A. Galatolo, Mario G.C.A. Cimino
Abstract summary: This study introduces a novel approach for generating high-quality, language-specific chat corpora using a self-chat mechanism. We generate an Italian chat corpus and the Fauno corpus, which is based on English ChatGPT self-chat data. The Italian LLM fine-tuned with these corpora demonstrates significantly enhanced language comprehension and question-answering skills.
Score: 0.5967382410041416
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This study introduces a novel approach for generating high-quality, language-specific chat corpora using a self-chat mechanism. We combine a generator LLM for creating new samples and an embedder LLM to ensure diversity. A new Masked Language Modelling (MLM) model-based quality assessment metric is proposed for evaluating and filtering the corpora. Utilizing the llama2-70b as the generator and a multilingual sentence transformer as embedder, we generate an Italian chat corpus and refine the Fauno corpus, which is based on translated English ChatGPT self-chat data. The refinement uses structural assertions and Natural Language Processing techniques. Both corpora undergo a comprehensive quality evaluation using the proposed MLM model-based quality metric. The Italian LLM fine-tuned with these corpora demonstrates significantly enhanced language comprehension and question-answering skills. The resultant model, cerbero-7b, establishes a new state-of-the-art for Italian LLMs. This approach marks a substantial advancement in the development of language-specific LLMs, with a special emphasis on augmenting corpora for underrepresented languages like Italian.

Related papers

MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language [16.21019515431378]
We propose MUG-Eval, a novel framework that evaluates large language models' multilingual generation capabilities.<n>We transform existing benchmarks into conversational tasks and measure the LLMs' accuracies on those tasks.<n>We evaluate 8 LLMs across 30 languages spanning high, mid, and low-resource categories, and we find that MUG-Eval correlates strongly with established benchmarks.
arXiv Detail & Related papers (2025-05-20T14:14:00Z)
M-Prometheus: A Suite of Open Multilingual LLM Judges [64.22940792713713]
We introduce M-Prometheus, a suite of open-weight LLM judges that can provide both direct assessment and pairwise comparison feedback on multilingual outputs. M-Prometheus models outperform state-of-the-art open LLM judges on multilingual reward benchmarks spanning more than 20 languages, as well as on literary machine translation (MT) evaluation covering 4 language pairs.
arXiv Detail & Related papers (2025-04-07T11:37:26Z)
The Rise and Down of Babel Tower: Investigating the Evolution Process of Multilingual Code Large Language Model [59.357993924917]
We study the evolution of multilingual capabilities in large language models (LLMs) during the pre-training process. We propose the Babel Tower Hypothesis, which describes the entire process of LLMs acquiring new language capabilities. We propose a novel method to construct an optimized pre-training corpus for multilingual code LLMs.
arXiv Detail & Related papers (2024-12-10T08:28:57Z)
Do Large Language Models Have an English Accent? Evaluating and Improving the Naturalness of Multilingual LLMs [13.558778781305998]
Large Language Models (LLMs) are predominantly designed with English as the primary language. Even the few that are multilingual tend to exhibit strong English-centric biases. This paper introduces novel automatic corpus-level metrics to assess the lexical and syntactic naturalness of multilingual outputs.
arXiv Detail & Related papers (2024-10-21T12:34:17Z)
How Do Multilingual Language Models Remember Facts? [50.13632788453612]
We show that previously identified recall mechanisms in English largely apply to multilingual contexts. We localize the role of language during recall, finding that subject enrichment is language-independent. In decoder-only LLMs, FVs compose these two pieces of information in two separate stages.
arXiv Detail & Related papers (2024-10-18T11:39:34Z)
Adapting LLMs to Hebrew: Unveiling DictaLM 2.0 with Enhanced Vocabulary and Instruction Capabilities [2.047424180164312]
Training large language models (LLMs) in low-resource languages such as Hebrew poses unique challenges. We introduce DictaLM2.0 and DictaLM2.0-Instruct, two LLMs trained on a substantial corpus of approximately 200 billion tokens in both Hebrew and English.
arXiv Detail & Related papers (2024-07-09T17:51:37Z)
Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models [117.20416338476856]
Large language models (LLMs) demonstrate remarkable multilingual capabilities without being pre-trained on specially curated multilingual parallel corpora. We propose a novel detection method, language activation probability entropy (LAPE), to identify language-specific neurons within LLMs. Our findings indicate that LLMs' proficiency in processing a particular language is predominantly due to a small subset of neurons.
arXiv Detail & Related papers (2024-02-26T09:36:05Z)
OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large Language Models [59.54423478596468]
We introduce OMGEval, the first Open-source Multilingual Generative test set that can assess the capability of LLMs in different languages. For each language, OMGEval provides 804 open-ended questions, covering a wide range of important capabilities of LLMs. Specifically, the current version of OMGEval includes 5 languages (i.e., Zh, Ru, Fr, Es, Ar)
arXiv Detail & Related papers (2024-02-21T04:42:41Z)
Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z)
LLaMAntino: LLaMA 2 Models for Effective Text Generation in Italian Language [7.214355350362308]
The LLaMA (Large Language Model Meta AI) family represents a novel advancement in the field of natural language processing. This study contributes to Language Adaptation strategies for the Italian language by introducing the novel LLaMA family of Italian LLMs.
arXiv Detail & Related papers (2023-12-15T18:06:22Z)
Establishing Vocabulary Tests as a Benchmark for Evaluating Large Language Models [2.7013338932521416]
We advocate for the revival of vocabulary tests as a valuable tool for assessing Large Language Models (LLMs) performance. We evaluate seven LLMs using two vocabulary test formats across two languages and uncover surprising gaps in their lexical knowledge.
arXiv Detail & Related papers (2023-10-23T08:45:12Z)
Let Models Speak Ciphers: Multiagent Debate through Embeddings [84.20336971784495]
We introduce CIPHER (Communicative Inter-Model Protocol Through Embedding Representation) to address this issue. By deviating from natural language, CIPHER offers an advantage of encoding a broader spectrum of information without any modification to the model weights. This showcases the superiority and robustness of embeddings as an alternative "language" for communication among LLMs.
arXiv Detail & Related papers (2023-10-10T03:06:38Z)
L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs) We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods. In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z)
Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars. We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English. Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.