Related papers: Opening up ChatGPT: Tracking openness, transparency, and accountability in instruction-tuned text generators

Opening up ChatGPT: Tracking openness, transparency, and accountability in instruction-tuned text generators

URL: http://arxiv.org/abs/2307.05532v1
Date: Sat, 8 Jul 2023 07:08:20 GMT
Title: Opening up ChatGPT: Tracking openness, transparency, and accountability in instruction-tuned text generators
Authors: Andreas Liesenfeld, Alianda Lopez, Mark Dingemanse
Abstract summary: We evaluate projects in terms of openness of code, training data, model weights, RLHF data, licensing, scientific documentation, and access methods. We find that while there is a fast-growing list of projects billing themselves as 'open source', many inherit undocumented data of dubious legality. Degrees of openness are relevant to fairness and accountability at all points.
Score: 0.11470070927586018
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models that exhibit instruction-following behaviour represent one of the biggest recent upheavals in conversational interfaces, a trend in large part fuelled by the release of OpenAI's ChatGPT, a proprietary large language model for text generation fine-tuned through reinforcement learning from human feedback (LLM+RLHF). We review the risks of relying on proprietary software and survey the first crop of open-source projects of comparable architecture and functionality. The main contribution of this paper is to show that openness is differentiated, and to offer scientific documentation of degrees of openness in this fast-moving field. We evaluate projects in terms of openness of code, training data, model weights, RLHF data, licensing, scientific documentation, and access methods. We find that while there is a fast-growing list of projects billing themselves as 'open source', many inherit undocumented data of dubious legality, few share the all-important instruction-tuning (a key site where human annotation labour is involved), and careful scientific documentation is exceedingly rare. Degrees of openness are relevant to fairness and accountability at all points, from data collection and curation to model architecture, and from training and fine-tuning to release and deployment.

Related papers

Comprehensive Analysis of Transparency and Accessibility of ChatGPT, DeepSeek, And other SoTA Large Language Models [2.6900047294457683]
Despite increasing discussions on open-source Artificial Intelligence (AI), existing research lacks a discussion on the transparency and accessibility of state-of-the-art (SoTA) Large Language Models (LLMs) This study critically analyzes SoTA LLMs from the last five years, including ChatGPT, DeepSeek, LLaMA, and others, to assess their adherence to transparency standards and the implications of partial openness. Our findings reveal that while some models are labeled as open-source, this does not necessarily mean they are fully open-sourced.
arXiv Detail & Related papers (2025-02-21T23:53:13Z)
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [70.72097493954067]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems. While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited. We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z)
OpenR: An Open Source Framework for Advanced Reasoning with Large Language Models [61.14336781917986]
We introduce OpenR, an open-source framework for enhancing the reasoning capabilities of large language models (LLMs) OpenR unifies data acquisition, reinforcement learning training, and non-autoregressive decoding into a cohesive software platform. Our work is the first to provide an open-source framework that explores the core techniques of OpenAI's o1 model with reinforcement learning.
arXiv Detail & Related papers (2024-10-12T23:42:16Z)
In-Context Code-Text Learning for Bimodal Software Engineering [26.0027882745058]
Bimodal software analysis initially appeared to be within reach with the advent of large language models. We postulate that in-context learning for the code-text bimodality is a promising avenue. We consider a diverse dataset encompassing 23 software engineering tasks, which we transform in an in-context learning format.
arXiv Detail & Related papers (2024-10-08T19:42:00Z)
Building and better understanding vision-language models: insights and future directions [8.230565679484128]
This paper provides a comprehensive overview of the current state-of-the-art approaches to vision-language models. We then walk through the practical steps to build Idefics3-8B, a powerful VLM that significantly outperforms its predecessor Idefics2-8B. We release the model along with the datasets created for its training.
arXiv Detail & Related papers (2024-08-22T17:47:24Z)
Query of CC: Unearthing Large Scale Domain-Specific Knowledge from Public Corpora [104.16648246740543]
We propose an efficient data collection method based on large language models. The method bootstraps seed information through a large language model and retrieves related data from public corpora. It not only collects knowledge-related data for specific domains but unearths the data with potential reasoning procedures.
arXiv Detail & Related papers (2024-01-26T03:38:23Z)
SoTaNa: The Open-Source Software Development Assistant [81.86136560157266]
SoTaNa is an open-source software development assistant. It generates high-quality instruction-based data for the domain of software engineering. It employs a parameter-efficient fine-tuning approach to enhance the open-source foundation model, LLaMA.
arXiv Detail & Related papers (2023-08-25T14:56:21Z)
Generate rather than Retrieve: Large Language Models are Strong Context Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators. We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z)
A Survey on Open Information Extraction from Rule-based Model to Large Language Model [29.017823043117144]
Open Information Extraction (OpenIE) represents a crucial NLP task aimed at deriving structured information from unstructured text. This survey paper provides an overview of OpenIE technologies spanning from 2007 to 2024, emphasizing a chronological perspective. The paper categorizes OpenIE approaches into rule-based, neural, and pre-trained large language models, discussing each within a chronological framework.
arXiv Detail & Related papers (2022-08-18T08:03:45Z)
ENT-DESC: Entity Description Generation by Exploring Knowledge Graph [53.03778194567752]
In practice, the input knowledge could be more than enough, since the output description may only cover the most significant knowledge. We introduce a large-scale and challenging dataset to facilitate the study of such a practical scenario in KG-to-text. We propose a multi-graph structure that is able to represent the original graph information more comprehensively.
arXiv Detail & Related papers (2020-04-30T14:16:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.