Related papers: Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code

Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code

URL: http://arxiv.org/abs/2404.00399v3
Date: Fri, 27 Dec 2024 03:53:21 GMT
Title: Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code
Authors: Taishi Nakamura, Mayank Mishra, Simone Tedeschi, Yekun Chai, Jason T Stillerman, Felix Friedrich, Prateek Yadav, Tanmay Laud, Vu Minh Chien, Terry Yue Zhuo, Diganta Misra, Ben Bogin, Xuan-Son Vu, Marzena Karpinska, Arnav Varma Dantuluri, Wojciech Kusa, Tommaso Furlanello, Rio Yokota, Niklas Muennighoff, Suhas Pai, Tosin Adewumi, Veronika Laippala, Xiaozhe Yao, Adalberto Junior, Alpay Ariyak, Aleksandr Drozd, Jordan Clive, Kshitij Gupta, Liangyu Chen, Qi Sun, Ken Tsui, Noah Persaud, Nour Fahmy, Tianlong Chen, Mohit Bansal, Nicolo Monti, Tai Dang, Ziyang Luo, Tien-Tung Bui, Roberto Navigli, Virendra Mehta, Matthew Blumberg, Victor May, Huu Nguyen, Sampo Pyysalo,
Abstract summary: This paper presents Aurora-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code.<n>It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions.<n>We evaluate Aurora-M across a wide range of tasks and languages, showcasing its robustness against catastrophic forgetting.
Score: 123.7406091753529
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pretrained language models are an integral part of AI applications, but their high computational cost for training limits accessibility. Initiatives such as Bloom and StarCoder aim to democratize access to pretrained models for collaborative community development. Despite these efforts, such models encounter challenges such as limited multilingual capabilities, risks of catastrophic forgetting during continual pretraining, and the high costs of training models from scratch, alongside the need to align with AI safety standards and regulatory frameworks. This paper presents Aurora-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from StarCoderPlus on 435B additional tokens, Aurora-M surpasses 2T tokens in total training token count. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions, thus aligning its development not only with conventional red-teaming considerations, but also with the specific concerns articulated in the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. We evaluate Aurora-M across a wide range of tasks and languages, showcasing its robustness against catastrophic forgetting and its superior performance in multilingual settings, particularly in safety evaluations. We open-source Aurora-M and its variants to encourage responsible open-source development of large language models at https://huggingface.co/aurora-m.

Related papers

MR. Guard: Multilingual Reasoning Guardrail using Curriculum Learning [56.79292318645454]
Large Language Models (LLMs) are susceptible to adversarial attacks such as jailbreaking. This vulnerability is exacerbated in multilingual setting, where multilingual safety-aligned data are often limited. We propose an approach to build a multilingual guardrail with reasoning.
arXiv Detail & Related papers (2025-04-21T17:15:06Z)
DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails [12.621656255109546]
We propose a novel two-player Reinforcement Learning framework, where a generator and a guardrail model co-evolve adversarially to produce high-quality synthetic data for multilingual guardrail training. Empirical evaluations show that our model ours outperforms state-of-the-art models, achieving nearly 10% improvement over LlamaGuard3 on English benchmarks.
arXiv Detail & Related papers (2025-02-07T18:45:03Z)
Poro 34B and the Blessing of Multilinguality [3.270981284471548]
Poro 34B is a 34 billion parameter model trained for 1 trillion tokens of Finnish, English, and programming languages. We show that a multilingual training approach can produce a model that substantially advances over the capabilities of existing models for Finnish.
arXiv Detail & Related papers (2024-04-02T11:34:12Z)
TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese [0.0]
Large language models (LLMs) have significantly advanced natural language processing, but their progress has yet to be equal across languages. In this study, we document the development of open-foundation models tailored for use in low-resource settings. This is the TeenyTinyLlama pair: two compact models for Brazilian Portuguese text generation.
arXiv Detail & Related papers (2024-01-30T00:25:54Z)
YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters. YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline. The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z)
Baichuan 2: Open Large-scale Language Models [51.56361715162972]
We present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens. Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval.
arXiv Detail & Related papers (2023-09-19T04:13:22Z)
PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z)
GreenPLM: Cross-Lingual Transfer of Monolingual Pre-Trained Language Models at Almost No Cost [7.510253441699812]
This study proposes a framework called GreenPLM that uses bilingual lexicons to directly "translate" pre-trained language models into another language. We validate this approach in 18 languages' BERT models and show that this framework is comparable to, if not better than, other frameworks with high training costs. In six out of seven tested languages, this framework outperforms the original monolingual language models with up to 200x less pre-training efforts.
arXiv Detail & Related papers (2022-11-13T18:59:15Z)
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model [264.96498474333697]
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. We present BLOOM, a 176B- parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages.
arXiv Detail & Related papers (2022-11-09T18:48:09Z)
Generalizing Multimodal Pre-training into Multilingual via Language Acquisition [54.69707237195554]
English-based Vision-Language Pre-training has achieved great success in various downstream tasks. Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training. We propose a textbfMultitextbfLingual textbfAcquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual.
arXiv Detail & Related papers (2022-05-29T08:53:22Z)
Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages [12.00637655338665]
We study very low-resource languages and handle 50 African languages, many of which are not covered by any other model. We train sentence encoders, mine bitexts, and validate the bitexts by training NMT systems. For these languages, we train sentence encoders, mine bitexts, and validate the bitexts by training NMT systems.
arXiv Detail & Related papers (2022-05-25T10:53:24Z)
Multilingual Translation with Extensible Multilingual Pretraining and Finetuning [77.33262578776291]
Previous work has demonstrated that machine translation systems can be created by finetuning on bitext. We show that multilingual translation models can be created through multilingual finetuning. We demonstrate that pretrained models can be extended to incorporate additional languages without loss of performance.
arXiv Detail & Related papers (2020-08-02T05:36:55Z)
From English To Foreign Languages: Transferring Pre-trained Language Models [0.12691047660244334]
Pre-trained models have demonstrated their effectiveness in many downstream natural language processing (NLP) tasks. The availability of multilingual pre-trained models enables zero-shot transfer of NLP tasks from high resource languages to low resource ones. We tackle the problem of transferring an existing pre-trained model from English to other languages under a limited computational budget.
arXiv Detail & Related papers (2020-02-18T00:22:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.