BigScience: A Case Study in the Social Construction of a Multilingual
Large Language Model
- URL: http://arxiv.org/abs/2212.04960v1
- Date: Fri, 9 Dec 2022 16:15:35 GMT
- Title: BigScience: A Case Study in the Social Construction of a Multilingual
Large Language Model
- Authors: Christopher Akiki and Giada Pistilli and Margot Mieskes and Matthias
Gall\'e and Thomas Wolf and Suzana Ili\'c and Yacine Jernite
- Abstract summary: The BigScience Workshop was a value-driven initiative that spanned one and half years of interdisciplinary research.
This paper focuses on the collaborative research aspects of BigScience and takes a step back to look at the challenges of large-scale participatory research.
- Score: 11.366450629112459
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The BigScience Workshop was a value-driven initiative that spanned one and
half years of interdisciplinary research and culminated in the creation of
ROOTS, a 1.6TB multilingual dataset that was used to train BLOOM, one of the
largest multilingual language models to date. In addition to the technical
outcomes and artifacts, the workshop fostered multidisciplinary collaborations
around large models, datasets, and their analysis. This in turn led to a wide
range of research publications spanning topics from ethics to law, data
governance, modeling choices and distributed training. This paper focuses on
the collaborative research aspects of BigScience and takes a step back to look
at the challenges of large-scale participatory research, with respect to
participant diversity and the tasks required to successfully carry out such a
project. Our main goal is to share the lessons we learned from this experience,
what we could have done better and what we did well. We show how the impact of
such a social approach to scientific research goes well beyond the technical
artifacts that were the basis of its inception.
Related papers
- What is the Role of Large Language Models in the Evolution of Astronomy Research? [0.0]
ChatGPT and other state-of-the-art large language models (LLMs) are rapidly transforming multiple fields.
These models, commonly trained on vast datasets, exhibit human-like text generation capabilities.
arXiv Detail & Related papers (2024-09-30T12:42:25Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
This dataset includes figures such as schematic diagrams, simulated images, macroscopic/microscopic photos, and experimental visualizations.
We developed benchmarks for scientific figure captioning and multiple-choice questions, evaluating six proprietary and over ten open-source models.
The dataset and benchmarks will be released to support further research.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery [68.48094108571432]
Large language models (LLMs) have revolutionized the way text and other modalities of data are handled.
We aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs.
arXiv Detail & Related papers (2024-06-16T08:03:24Z) - MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows [58.56005277371235]
We introduce MASSW, a comprehensive text dataset on Multi-Aspect Summarization of ScientificAspects.
MASSW includes more than 152,000 peer-reviewed publications from 17 leading computer science conferences spanning the past 50 years.
We demonstrate the utility of MASSW through multiple novel machine-learning tasks that can be benchmarked using this new dataset.
arXiv Detail & Related papers (2024-06-10T15:19:09Z) - Efficient Large Language Models: A Survey [45.39970635367852]
This survey provides a systematic and comprehensive review of efficient Large Language Models research.
We organize the literature in a taxonomy consisting of three main categories, covering distinct yet interconnected efficient LLMs topics.
We have also created a GitHub repository where we organize the papers featured in this survey.
arXiv Detail & Related papers (2023-12-06T19:18:42Z) - A Comprehensive Overview of Large Language Models [68.22178313875618]
Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks.
This article provides an overview of the existing literature on a broad range of LLM-related concepts.
arXiv Detail & Related papers (2023-07-12T20:01:52Z) - The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset [36.98035382552118]
The BigScience workshop was formed with the goal of researching and training large language models as a values-driven undertaking.
This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus.
arXiv Detail & Related papers (2023-03-07T14:25:44Z) - Industry-Academia Research Collaboration in Software Engineering: The
Certus Model [13.021014899410684]
Building scalable and effective research collaborations in software engineering is known to be challenging.
This paper aims to understand what are the elements of a successful industry-academia collaboration that enable the culture of participative knowledge creation.
arXiv Detail & Related papers (2022-04-23T10:16:23Z) - Explaining Relationships Between Scientific Documents [55.23390424044378]
We address the task of explaining relationships between two scientific documents using natural language text.
In this paper we establish a dataset of 622K examples from 154K documents.
arXiv Detail & Related papers (2020-02-02T03:54:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.