BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
- URL: http://arxiv.org/abs/2211.05100v4
- Date: Tue, 27 Jun 2023 09:57:58 GMT
- Title: BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
- Authors: BigScience Workshop: Teven Le Scao, Angela Fan, Christopher Akiki,
Ellie Pavlick, Suzana Ili\'c, Daniel Hesslow, Roman Castagn\'e, Alexandra
Sasha Luccioni, Fran\c{c}ois Yvon, Matthias Gall\'e, Jonathan Tow, Alexander
M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas
Wang, Beno\^it Sagot, Niklas Muennighoff, Albert Villanova del Moral,
Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, Iz
Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor
Sanh, Hugo Lauren\c{c}on, Yacine Jernite, Julien Launay, Margaret Mitchell,
Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Soroa, Alham Fikri Aji, Amit
Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris
Emezue, Christopher Klamm, Colin Leong, Daniel van Strien, David Ifeoluwa
Adelani, Dragomir Radev, Eduardo Gonz\'alez Ponferrada, Efrat Levkovizh,
Ethan Kim, Eyal Bar Natan, Francesco De Toni, G\'erard Dupont, Germ\'an
Kruszewski, Giada Pistilli, Hady Elsahar, Hamza Benyamina, Hieu Tran, Ian Yu,
Idris Abdulmumin, Isaac Johnson, Itziar Gonzalez-Dios, Javier de la Rosa,
Jenny Chim, Jesse Dodge, Jian Zhu, Jonathan Chang, J\"org Frohberg, Joseph
Tobing, Joydeep Bhattacharjee, Khalid Almubarak, Kimbo Chen, Kyle Lo, Leandro
Von Werra, Leon Weber, Long Phan, Loubna Ben allal, Ludovic Tanguy, Manan
Dey, Manuel Romero Mu\~noz, Maraim Masoud, Mar\'ia Grandury, Mario
\v{S}a\v{s}ko, Max Huang, Maximin Coavoux, Mayank Singh, Mike Tian-Jian
Jiang, Minh Chien Vu, Mohammad A. Jauhar, Mustafa Ghaleb, Nishant Subramani,
Nora Kassner, Nurulaqilla Khamis, Olivier Nguyen, Omar Espejel, Ona de
Gibert, Paulo Villegas, Peter Henderson, Pierre Colombo, Priscilla Amuok,
Quentin Lhoest, Rheza Harliman, Rishi Bommasani, Roberto Luis L\'opez, Rui
Ribeiro, Salomey Osei, Sampo Pyysalo, Sebastian Nagel, Shamik Bose,
Shamsuddeen Hassan Muhammad, Shanya Sharma, Shayne Longpre, Somaieh Nikpoor,
Stanislav Silberberg, Suhas Pai, Sydney Zink, Tiago Timponi Torrent, Timo
Schick, Tristan Thrush, Valentin Danchev, Vassilina Nikoulina, Veronika
Laippala, Violette Lepercq, Vrinda Prabhu, Zaid Alyafeai, Zeerak Talat, Arun
Raja, Benjamin Heinzerling, Chenglei Si, Davut Emre Ta\c{s}ar, Elizabeth
Salesky, Sabrina J. Mielke, Wilson Y. Lee, Abheesht Sharma, Andrea Santilli,
Antoine Chaffin, Arnaud Stiegler, Debajyoti Datta, Eliza Szczechla, Gunjan
Chhablani, Han Wang, Harshit Pandey, Hendrik Strobelt, Jason Alan Fries, Jos
Rozen, Leo Gao, Lintang Sutawika, M Saiful Bari, Maged S. Al-shaibani, Matteo
Manica, Nihal Nayak, Ryan Teehan, Samuel Albanie, Sheng Shen, Srulik
Ben-David, Stephen H. Bach, Taewoon Kim, Tali Bers, Thibault Fevry, Trishala
Neeraj, Urmish Thakker, Vikas Raunak, Xiangru Tang, Zheng-Xin Yong, Zhiqing
Sun, Shaked Brody, Yallow Uri, Hadar Tojarieh, Adam Roberts, Hyung Won Chung,
Jaesung Tae, Jason Phang, Ofir Press, Conglong Li, Deepak Narayanan, Hatim
Bourfoune, Jared Casper, Jeff Rasley, Max Ryabinin, Mayank Mishra, Minjia
Zhang, Mohammad Shoeybi, Myriam Peyrounette, Nicolas Patry, Nouamane Tazi,
Omar Sanseviero, Patrick von Platen, Pierre Cornette, Pierre Fran\c{c}ois
Lavall\'ee, R\'emi Lacroix, Samyam Rajbhandari, Sanchit Gandhi, Shaden Smith,
St\'ephane Requena, Suraj Patil, Tim Dettmers, Ahmed Baruwa, Amanpreet Singh,
Anastasia Cheveleva, Anne-Laure Ligozat, Arjun Subramonian, Aur\'elie
N\'ev\'eol, Charles Lovering, Dan Garrette, Deepak Tunuguntla, Ehud Reiter,
Ekaterina Taktasheva, Ekaterina Voloshina, Eli Bogdanov, Genta Indra Winata,
Hailey Schoelkopf, Jan-Christoph Kalo, Jekaterina Novikova, Jessica Zosa
Forde, Jordan Clive, Jungo Kasai, Ken Kawamura, Liam Hazan, Marine Carpuat,
Miruna Clinciu, Najoung Kim, Newton Cheng, Oleg Serikov, Omer Antverg, Oskar
van der Wal, Rui Zhang, Ruochen Zhang, Sebastian Gehrmann, Shachar Mirkin,
Shani Pais, Tatiana Shavrina, Thomas Scialom, Tian Yun, Tomasz Limisiewicz,
Verena Rieser, Vitaly Protasov, Vladislav Mikhailov, Yada Pruksachatkun,
Yonatan Belinkov, Zachary Bamberger, Zden\v{e}k Kasner, Alice Rueda, Amanda
Pestana, Amir Feizpour, Ammar Khan, Amy Faranak, Ana Santos, Anthony Hevia,
Antigona Unldreaj, Arash Aghagol, Arezoo Abdollahi, Aycha Tammour, Azadeh
HajiHosseini, Bahareh Behroozi, Benjamin Ajibade, Bharat Saxena, Carlos
Mu\~noz Ferrandis, Daniel McDuff, Danish Contractor, David Lansky, Davis
David, Douwe Kiela, Duong A. Nguyen, Edward Tan, Emi Baylor, Ezinwanne
Ozoani, Fatima Mirza, Frankline Ononiwu, Habib Rezanejad, Hessie Jones,
Indrani Bhattacharya, Irene Solaiman, Irina Sedenko, Isar Nejadgholi, Jesse
Passmore, Josh Seltzer, Julio Bonis Sanz, Livia Dutra, Mairon Samagaio,
Maraim Elbadri, Margot Mieskes, Marissa Gerchick, Martha Akinlolu, Michael
McKenna, Mike Qiu, Muhammed Ghauri, Mykola Burynok, Nafis Abrar, Nazneen
Rajani, Nour Elkott, Nour Fahmy, Olanrewaju Samuel, Ran An, Rasmus Kromann,
Ryan Hao, Samira Alizadeh, Sarmad Shubber, Silas Wang, Sourav Roy, Sylvain
Viguier, Thanh Le, Tobi Oyebade, Trieu Le, Yoyo Yang, Zach Nguyen, Abhinav
Ramesh Kashyap, Alfredo Palasciano, Alison Callahan, Anima Shukla, Antonio
Miranda-Escalada, Ayush Singh, Benjamin Beilharz, Bo Wang, Caio Brito, Chenxi
Zhou, Chirag Jain, Chuxin Xu, Cl\'ementine Fourrier, Daniel Le\'on
Peri\~n\'an, Daniel Molano, Dian Yu, Enrique Manjavacas, Fabio Barth, Florian
Fuhrimann, Gabriel Altay, Giyaseddin Bayrak, Gully Burns, Helena U. Vrabec,
Imane Bello, Ishani Dash, Jihyun Kang, John Giorgi, Jonas Golde, Jose David
Posada, Karthik Rangasai Sivaraman, Lokesh Bulchandani, Lu Liu, Luisa
Shinzato, Madeleine Hahn de Bykhovetz, Maiko Takeuchi, Marc P\`amies, Maria A
Castillo, Marianna Nezhurina, Mario S\"anger, Matthias Samwald, Michael
Cullan, Michael Weinberg, Michiel De Wolf, Mina Mihaljcic, Minna Liu, Moritz
Freidank, Myungsun Kang, Natasha Seelam, Nathan Dahlberg, Nicholas Michio
Broad, Nikolaus Muellner, Pascale Fung, Patrick Haller, Ramya Chandrasekhar,
Renata Eisenberg, Robert Martin, Rodrigo Canalli, Rosaline Su, Ruisi Su,
Samuel Cahyawijaya, Samuele Garda, Shlok S Deshmukh, Shubhanshu Mishra, Sid
Kiblawi, Simon Ott, Sinee Sang-aroonsiri, Srishti Kumar, Stefan Schweter,
Sushil Bharati, Tanmay Laud, Th\'eo Gigant, Tomoya Kainuma, Wojciech Kusa,
Yanis Labrak, Yash Shailesh Bajaj, Yash Venkatraman, Yifan Xu, Yingxin Xu, Yu
Xu, Zhe Tan, Zhongli Xie, Zifan Ye, Mathilde Bras, Younes Belkada, Thomas
Wolf
- Abstract summary: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions.
We present BLOOM, a 176B- parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers.
BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages.
- Score: 264.96498474333697
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have been shown to be able to perform new tasks
based on a few demonstrations or natural language instructions. While these
capabilities have led to widespread adoption, most LLMs are developed by
resource-rich organizations and are frequently kept from the public. As a step
towards democratizing this powerful technology, we present BLOOM, a
176B-parameter open-access language model designed and built thanks to a
collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer
language model that was trained on the ROOTS corpus, a dataset comprising
hundreds of sources in 46 natural and 13 programming languages (59 in total).
We find that BLOOM achieves competitive performance on a wide variety of
benchmarks, with stronger results after undergoing multitask prompted
finetuning. To facilitate future research and applications using LLMs, we
publicly release our models and code under the Responsible AI License.
Related papers
- Multilingual Large Language Models and Curse of Multilinguality [4.096453902709292]
Large Language Models (LLMs) have gained large popularity among Natural Language Processing (NLP) researchers and practitioners.
This paper navigates the landscape of multilingual LLMs, providing an introductory overview of their technical aspects.
It explains underlying architectures, objective functions, pre-training data sources, and tokenization methods.
arXiv Detail & Related papers (2024-06-15T11:31:39Z) - LLMs Beyond English: Scaling the Multilingual Capability of LLMs with Cross-Lingual Feedback [61.23008372927665]
We introduce xLLMs-100, which scales the multilingual capabilities of LLaMA and BLOOM to 100 languages.
We evaluate the multilingual understanding and generating capabilities of xLLMs-100 on five multilingual benchmarks.
arXiv Detail & Related papers (2024-06-03T20:25:12Z) - Tele-FLM Technical Report [96.19923831660266]
We introduce Tele-FLM (aka FLM-2), a 52B open-sourced multilingual large language model.
It features a stable, efficient pre-training paradigm and enhanced factual judgment capabilities.
It is comparable to strong open-sourced models that involve larger pre-training FLOPs, such as Llama2-70B and DeepSeek-67B.
arXiv Detail & Related papers (2024-04-25T14:34:47Z) - Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource Languages [0.0]
Large Language Models (LLMs) have shown incredible proficiency at natural language processing tasks.
LLMs often struggle to perform well on low-resource languages because there is so little training data available.
In this work, we explore training LLaMA-2 to speak Amharic, a language which is spoken by over 50 million people world wide.
arXiv Detail & Related papers (2024-03-11T01:04:36Z) - YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters.
YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline.
The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z) - Investigating the Translation Performance of a Large Multilingual
Language Model: the Case of BLOOM [8.858671209228536]
We focus on BLOOM's multilingual ability by evaluating its machine translation performance across several datasets.
We study several aspects including prompt design, model sizes, cross-lingual transfer and the use of discursive context.
arXiv Detail & Related papers (2023-03-03T13:23:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.