Cerbero-7B: A Leap Forward in Language-Specific LLMs Through Enhanced
Chat Corpus Generation and Evaluation
- URL: http://arxiv.org/abs/2311.15698v1
- Date: Mon, 27 Nov 2023 10:34:55 GMT
- Title: Cerbero-7B: A Leap Forward in Language-Specific LLMs Through Enhanced
Chat Corpus Generation and Evaluation
- Authors: Federico A. Galatolo, Mario G.C.A. Cimino
- Abstract summary: This study introduces a novel approach for generating high-quality, language-specific chat corpora using a self-chat mechanism.
We generate an Italian chat corpus and the Fauno corpus, which is based on English ChatGPT self-chat data.
The Italian LLM fine-tuned with these corpora demonstrates significantly enhanced language comprehension and question-answering skills.
- Score: 0.5967382410041416
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study introduces a novel approach for generating high-quality,
language-specific chat corpora using a self-chat mechanism. We combine a
generator LLM for creating new samples and an embedder LLM to ensure diversity.
A new Masked Language Modelling (MLM) model-based quality assessment metric is
proposed for evaluating and filtering the corpora. Utilizing the llama2-70b as
the generator and a multilingual sentence transformer as embedder, we generate
an Italian chat corpus and refine the Fauno corpus, which is based on
translated English ChatGPT self-chat data. The refinement uses structural
assertions and Natural Language Processing techniques. Both corpora undergo a
comprehensive quality evaluation using the proposed MLM model-based quality
metric. The Italian LLM fine-tuned with these corpora demonstrates
significantly enhanced language comprehension and question-answering skills.
The resultant model, cerbero-7b, establishes a new state-of-the-art for Italian
LLMs. This approach marks a substantial advancement in the development of
language-specific LLMs, with a special emphasis on augmenting corpora for
underrepresented languages like Italian.
Related papers
- Adapting LLMs to Hebrew: Unveiling DictaLM 2.0 with Enhanced Vocabulary and Instruction Capabilities [2.047424180164312]
Training large language models (LLMs) in low-resource languages such as Hebrew poses unique challenges.
We introduce DictaLM2.0 and DictaLM2.0-Instruct, two LLMs trained on a substantial corpus of approximately 200 billion tokens in both Hebrew and English.
arXiv Detail & Related papers (2024-07-09T17:51:37Z) - Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models [117.20416338476856]
Large language models (LLMs) demonstrate remarkable multilingual capabilities without being pre-trained on specially curated multilingual parallel corpora.
We propose a novel detection method, language activation probability entropy (LAPE), to identify language-specific neurons within LLMs.
Our findings indicate that LLMs' proficiency in processing a particular language is predominantly due to a small subset of neurons.
arXiv Detail & Related papers (2024-02-26T09:36:05Z) - OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large
Language Models [59.54423478596468]
We introduce OMGEval, the first Open-source Multilingual Generative test set that can assess the capability of LLMs in different languages.
For each language, OMGEval provides 804 open-ended questions, covering a wide range of important capabilities of LLMs.
Specifically, the current version of OMGEval includes 5 languages (i.e., Zh, Ru, Fr, Es, Ar)
arXiv Detail & Related papers (2024-02-21T04:42:41Z) - Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks.
LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z) - LLaMAntino: LLaMA 2 Models for Effective Text Generation in Italian
Language [7.214355350362308]
The LLaMA (Large Language Model Meta AI) family represents a novel advancement in the field of natural language processing.
This study contributes to Language Adaptation strategies for the Italian language by introducing the novel LLaMA family of Italian LLMs.
arXiv Detail & Related papers (2023-12-15T18:06:22Z) - Establishing Vocabulary Tests as a Benchmark for Evaluating Large
Language Models [2.7013338932521416]
We advocate for the revival of vocabulary tests as a valuable tool for assessing Large Language Models (LLMs) performance.
We evaluate seven LLMs using two vocabulary test formats across two languages and uncover surprising gaps in their lexical knowledge.
arXiv Detail & Related papers (2023-10-23T08:45:12Z) - Let Models Speak Ciphers: Multiagent Debate through Embeddings [84.20336971784495]
We introduce CIPHER (Communicative Inter-Model Protocol Through Embedding Representation) to address this issue.
By deviating from natural language, CIPHER offers an advantage of encoding a broader spectrum of information without any modification to the model weights.
This showcases the superiority and robustness of embeddings as an alternative "language" for communication among LLMs.
arXiv Detail & Related papers (2023-10-10T03:06:38Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - Several categories of Large Language Models (LLMs): A Short Survey [3.73538163699716]
Large Language Models(LLMs)have become effective tools for natural language processing and have been used in many different fields.
The survey emphasizes recent developments and efforts made for various LLM kinds, including task-based financial LLMs, multilingual language LLMs, biomedical and clinical LLMs, vision language LLMs, and code language models.
arXiv Detail & Related papers (2023-07-05T18:18:23Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.