Opening up ChatGPT: Tracking openness, transparency, and accountability
in instruction-tuned text generators
- URL: http://arxiv.org/abs/2307.05532v1
- Date: Sat, 8 Jul 2023 07:08:20 GMT
- Title: Opening up ChatGPT: Tracking openness, transparency, and accountability
in instruction-tuned text generators
- Authors: Andreas Liesenfeld, Alianda Lopez, Mark Dingemanse
- Abstract summary: We evaluate projects in terms of openness of code, training data, model weights, RLHF data, licensing, scientific documentation, and access methods.
We find that while there is a fast-growing list of projects billing themselves as 'open source', many inherit undocumented data of dubious legality.
Degrees of openness are relevant to fairness and accountability at all points.
- Score: 0.11470070927586018
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models that exhibit instruction-following behaviour represent
one of the biggest recent upheavals in conversational interfaces, a trend in
large part fuelled by the release of OpenAI's ChatGPT, a proprietary large
language model for text generation fine-tuned through reinforcement learning
from human feedback (LLM+RLHF). We review the risks of relying on proprietary
software and survey the first crop of open-source projects of comparable
architecture and functionality. The main contribution of this paper is to show
that openness is differentiated, and to offer scientific documentation of
degrees of openness in this fast-moving field. We evaluate projects in terms of
openness of code, training data, model weights, RLHF data, licensing,
scientific documentation, and access methods. We find that while there is a
fast-growing list of projects billing themselves as 'open source', many inherit
undocumented data of dubious legality, few share the all-important
instruction-tuning (a key site where human annotation labour is involved), and
careful scientific documentation is exceedingly rare. Degrees of openness are
relevant to fairness and accountability at all points, from data collection and
curation to model architecture, and from training and fine-tuning to release
and deployment.
Related papers
- Using Large Language Models to Enrich the Documentation of Datasets for Machine Learning [1.8270184406083445]
We explore using large language models (LLM) and prompting strategies to automatically extract dimensions from documents.
Our approach could aid data publishers and practitioners in creating machine-readable documentation.
We have released an open-source tool implementing our approach and a replication package, including the experiments' code and results.
arXiv Detail & Related papers (2024-04-04T10:09:28Z) - OLMo: Accelerating the Science of Language Models [165.16277690540363]
Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings.
As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces.
We believe it is essential for the research community to have access to powerful, truly open LMs.
We have built OLMo, a competitive, truly Open Language Model, to enable the scientific study of language models.
arXiv Detail & Related papers (2024-02-01T18:28:55Z) - Query of CC: Unearthing Large Scale Domain-Specific Knowledge from
Public Corpora [104.16648246740543]
We propose an efficient data collection method based on large language models.
The method bootstraps seed information through a large language model and retrieves related data from public corpora.
It not only collects knowledge-related data for specific domains but unearths the data with potential reasoning procedures.
arXiv Detail & Related papers (2024-01-26T03:38:23Z) - Semi-Structured Chain-of-Thought: Integrating Multiple Sources of Knowledge for Improved Language Model Reasoning [10.839645156881573]
We introduce a novel semi-structured prompting approach that seamlessly integrates the model's parametric memory with unstructured knowledge from text documents and structured knowledge from knowledge graphs.
Experimental results on open-domain multi-hop question answering datasets demonstrate that our prompting method significantly surpasses existing techniques.
arXiv Detail & Related papers (2023-11-14T19:53:53Z) - SoTaNa: The Open-Source Software Development Assistant [81.86136560157266]
SoTaNa is an open-source software development assistant.
It generates high-quality instruction-based data for the domain of software engineering.
It employs a parameter-efficient fine-tuning approach to enhance the open-source foundation model, LLaMA.
arXiv Detail & Related papers (2023-08-25T14:56:21Z) - Generate rather than Retrieve: Large Language Models are Strong Context
Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators.
We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z) - A Survey on Open Information Extraction from Rule-based Model to Large Language Model [29.017823043117144]
Open Information Extraction (OpenIE) represents a crucial NLP task aimed at deriving structured information from unstructured text.
This survey paper provides an overview of OpenIE technologies spanning from 2007 to 2024, emphasizing a chronological perspective.
The paper categorizes OpenIE approaches into rule-based, neural, and pre-trained large language models, discussing each within a chronological framework.
arXiv Detail & Related papers (2022-08-18T08:03:45Z) - OpenFed: A Comprehensive and Versatile Open-Source Federated Learning
Framework [5.893286029670115]
We propose OpenFed, an open-source software framework for end-to-end Federated Learning.
For researchers, OpenFed provides a framework wherein new methods can be easily implemented and fairly evaluated.
For downstream users, OpenFed allows Federated Learning to be plugged and play within different subject-matter contexts.
arXiv Detail & Related papers (2021-09-16T10:31:59Z) - KILT: a Benchmark for Knowledge Intensive Language Tasks [102.33046195554886]
We present a benchmark for knowledge-intensive language tasks (KILT)
All tasks in KILT are grounded in the same snapshot of Wikipedia.
We find that a shared dense vector index coupled with a seq2seq model is a strong baseline.
arXiv Detail & Related papers (2020-09-04T15:32:19Z) - ENT-DESC: Entity Description Generation by Exploring Knowledge Graph [53.03778194567752]
In practice, the input knowledge could be more than enough, since the output description may only cover the most significant knowledge.
We introduce a large-scale and challenging dataset to facilitate the study of such a practical scenario in KG-to-text.
We propose a multi-graph structure that is able to represent the original graph information more comprehensively.
arXiv Detail & Related papers (2020-04-30T14:16:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.