"Vorbeşti Româneşte?" A Recipe to Train Powerful Romanian LLMs with English Instructions
- URL: http://arxiv.org/abs/2406.18266v4
- Date: Thu, 24 Oct 2024 08:29:13 GMT
- Title: "Vorbeşti Româneşte?" A Recipe to Train Powerful Romanian LLMs with English Instructions
- Authors: Mihai Masala, Denis C. Ilie-Ablachim, Alexandru Dima, Dragos Corlatescu, Miruna Zavelca, Ovio Olaru, Simina Terian, Andrei Terian, Marius Leordeanu, Horia Velicu, Marius Popescu, Mihai Dascalu, Traian Rebedea,
- Abstract summary: We are the first to collect and translate a large collection of texts, instructions, and benchmarks and train, evaluate, and release open-source LLMs tailored for Romanian.
We argue for the usefulness and high performance of RoLLMs by obtaining state-of-the-art results across the board.
- Score: 40.64721381920061
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, Large Language Models (LLMs) have achieved almost human-like performance on various tasks. While some LLMs have been trained on multilingual data, most of the training data is in English; hence, their performance in English greatly exceeds other languages. To our knowledge, we are the first to collect and translate a large collection of texts, instructions, and benchmarks and train, evaluate, and release open-source LLMs tailored for Romanian. We evaluate our methods on four different categories, including academic benchmarks, MT-Bench (manually translated), and a professionally built historical, cultural, and social benchmark adapted to Romanian. We argue for the usefulness and high performance of RoLLMs by obtaining state-of-the-art results across the board. We publicly release all resources (i.e., data, training and evaluation code, models) to support and encourage research on Romanian LLMs while concurrently creating a generalizable recipe, adequate for other low or less-resourced languages.
Related papers
- LLMic: Romanian Foundation Language Model [76.09455151754062]
We present LLMic, a foundation language model designed specifically for the Romanian Language.
We show that fine-tuning LLMic for language translation after the initial pretraining phase outperforms existing solutions in English-to-Romanian translation tasks.
arXiv Detail & Related papers (2025-01-13T22:14:45Z) - Exploring Large Language Models for Translating Romanian Computational Problems into English [0.0]
This study shows that robust large language models (LLMs) can maintain or even enhance their performance in translating less common languages when given well-structured prompts.
We evaluate several translation methods across multiple LLMs, including OpenRoLLM, Llama 3.1 8B, Llama 3.2 3B and GPT-4o.
arXiv Detail & Related papers (2025-01-09T22:17:44Z) - Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization [108.6908427615402]
Cross-lingual summarization ( CLS) aims to generate a summary for the source text in a different target language.
Currently, instruction-tuned large language models (LLMs) excel at various English tasks.
Recent studies have shown that LLMs' performance on CLS tasks remains unsatisfactory even with few-shot settings.
arXiv Detail & Related papers (2024-10-26T00:39:44Z) - Multilingual Prompts in LLM-Based Recommenders: Performance Across Languages [0.0]
This work explores the impact of non-English prompts on recommendation performance.
Evaluation on three real-world datasets, namely ML1M, LastFM, and Amazon-Beauty, showed that usage of non-English prompts generally reduce performance.
Retraining with multilingual prompts resulted in more balanced performance across languages, but slightly reduced English performance.
arXiv Detail & Related papers (2024-09-11T20:31:42Z) - OpenLLM-Ro -- Technical Report on Open-source Romanian LLMs [11.689131290480619]
Large Language Models (LLMs) have achieved almost human-like performance on various tasks.
This document presents our approach to training and evaluating the first foundational and chat LLM specialized for Romanian.
arXiv Detail & Related papers (2024-05-13T12:46:11Z) - Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource Languages [0.0]
Large Language Models (LLMs) have shown incredible proficiency at natural language processing tasks.
LLMs often struggle to perform well on low-resource languages because there is so little training data available.
In this work, we explore training LLaMA-2 to speak Amharic, a language which is spoken by over 50 million people world wide.
arXiv Detail & Related papers (2024-03-11T01:04:36Z) - Enhancing Multilingual Capabilities of Large Language Models through
Self-Distillation from Resource-Rich Languages [60.162717568496355]
Large language models (LLMs) have been pre-trained on multilingual corpora.
Their performance still lags behind in most languages compared to a few resource-rich languages.
arXiv Detail & Related papers (2024-02-19T15:07:32Z) - Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks.
LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z) - TaCo: Enhancing Cross-Lingual Transfer for Low-Resource Languages in LLMs through Translation-Assisted Chain-of-Thought Processes [9.254047358707014]
We introduce the Multilingual Instruction-Tuning dataset (MITS), comprised of Alpaca-52K, Dolly-15K, and Vicuna Benchmark translations into 132 languages.
Secondly, we propose a new method called emphTaCo: Translation-Assisted Cross-Linguality, which utilizes translations in a chain-of-thought process to instruction-tune LLMs on new languages through a curriculum-learning process.
Our results indicate that the TaCo method impresses GPT-4 with an 82% score for a low-resource language in the Vicuna Benchmark dataset, doubling the performance in contrast to instruction tuning
arXiv Detail & Related papers (2023-11-17T06:55:32Z) - CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed.
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z) - Okapi: Instruction-tuned Large Language Models in Multiple Languages
with Reinforcement Learning from Human Feedback [61.83548032416181]
We present Okapi, the first system with instruction-tuned LLMs based on RLHF for multiple languages.
Okapi introduces instruction and response-ranked data in 26 diverse languages to facilitate the experiments and development of future multilingual LLM research.
arXiv Detail & Related papers (2023-07-29T18:01:46Z) - Resources and Few-shot Learners for In-context Learning in Slavic
Languages [0.22940141855172028]
We collect the infrastructure necessary for training and evaluation of in-context learning (ICL) in Slavic languages.
We evaluate a set of the most recent in-context learners and compare their results to the supervised baselines.
We find that ICL models tuned in English are also able to learn some tasks from non-English contexts.
arXiv Detail & Related papers (2023-04-04T16:16:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.