h2oGPT: Democratizing Large Language Models
- URL: http://arxiv.org/abs/2306.08161v2
- Date: Fri, 16 Jun 2023 17:48:22 GMT
- Title: h2oGPT: Democratizing Large Language Models
- Authors: Arno Candel, Jon McKinney, Philipp Singer, Pascal Pfeiffer, Maximilian
Jeblick, Prithvi Prabhu, Jeff Gambera, Mark Landry, Shivam Bansal, Ryan
Chesler, Chun Ming Lee, Marcos V. Conde, Pasha Stetsenko, Olivier Grellier,
SriSatish Ambati
- Abstract summary: We introduce h2oGPT, a suite of open-source code repositories for the creation and use of Large Language Models.
The goal of this project is to create the world's best truly open-source alternative to closed-source approaches.
- Score: 1.8043055303852882
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Applications built on top of Large Language Models (LLMs) such as GPT-4
represent a revolution in AI due to their human-level capabilities in natural
language processing. However, they also pose many significant risks such as the
presence of biased, private, or harmful text, and the unauthorized inclusion of
copyrighted material.
We introduce h2oGPT, a suite of open-source code repositories for the
creation and use of LLMs based on Generative Pretrained Transformers (GPTs).
The goal of this project is to create the world's best truly open-source
alternative to closed-source approaches. In collaboration with and as part of
the incredible and unstoppable open-source community, we open-source several
fine-tuned h2oGPT models from 7 to 40 Billion parameters, ready for commercial
use under fully permissive Apache 2.0 licenses. Included in our release is
100\% private document search using natural language.
Open-source language models help boost AI development and make it more
accessible and trustworthy. They lower entry hurdles, allowing people and
groups to tailor these models to their needs. This openness increases
innovation, transparency, and fairness. An open-source strategy is needed to
share AI benefits fairly, and H2O.ai will continue to democratize AI and LLMs.
Related papers
- Free to play: UN Trade and Development's experience with developing its own open-source Retrieval Augmented Generation Large Language Model application [0.0]
UNCTAD has explored and developed its own open-source Retrieval Augmented Generation (RAG) LLM application.
RAG makes Large Language Models aware of and more useful for the organization's domain and work.
Three libraries developed to produce the app, nlp_pipeline for document processing and statistical analysis, local_rag_llm for running a local RAG LLM, and streamlit_rag for the user interface, are publicly available on PyPI and GitHub with Dockerfiles.
arXiv Detail & Related papers (2024-06-18T14:23:54Z) - Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order [123.7406091753529]
Aurora-M is a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code.
It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions.
It is rigorously evaluated across various tasks and languages, demonstrating robustness against catastrophic forgetting.
arXiv Detail & Related papers (2024-03-30T15:38:54Z) - Is open source software culture enough to make AI a common ? [0.0]
Language models (LM) are increasingly deployed in the field of artificial intelligence (AI)
The question arises as to whether they can be a common resource managed and maintained by a community of users.
We highlight the potential benefits of treating the data and resources needed to create LMs as commons.
arXiv Detail & Related papers (2024-03-19T14:43:52Z) - TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese [0.0]
Large language models (LLMs) have significantly advanced natural language processing, but their progress has yet to be equal across languages.
In this study, we document the development of open-foundation models tailored for use in low-resource settings.
This is the TeenyTinyLlama pair: two compact models for Brazilian Portuguese text generation.
arXiv Detail & Related papers (2024-01-30T00:25:54Z) - YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters.
YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline.
The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z) - GlotLID: Language Identification for Low-Resource Languages [51.38634652914054]
GlotLID-M is an LID model that satisfies the desiderata of wide coverage, reliability and efficiency.
It identifies 1665 languages, a large increase in coverage compared to prior work.
arXiv Detail & Related papers (2023-10-24T23:45:57Z) - H2O Open Ecosystem for State-of-the-art Large Language Models [10.04351591653126]
Large Language Models (LLMs) represent a revolution in AI.
They also pose many significant risks, such as the presence of biased, private, copyrighted or harmful text.
We introduce a complete open-source ecosystem for developing and testing LLMs.
arXiv Detail & Related papers (2023-10-17T09:40:58Z) - On the Safety of Open-Sourced Large Language Models: Does Alignment
Really Prevent Them From Being Misused? [49.99955642001019]
We show that open-sourced, aligned large language models could be easily misguided to generate undesired content.
Our key idea is to directly manipulate the generation process of open-sourced LLMs to misguide it to generate undesired content.
arXiv Detail & Related papers (2023-10-02T19:22:01Z) - Open-Sourcing Highly Capable Foundation Models: An evaluation of risks,
benefits, and alternative methods for pursuing open-source objectives [6.575445633821399]
Recent decisions by leading AI labs to either open-source their models or to restrict access to their models has sparked debate.
This paper offers an examination of the risks and benefits of open-sourcing highly capable foundation models.
arXiv Detail & Related papers (2023-09-29T17:03:45Z) - Baichuan 2: Open Large-scale Language Models [51.56361715162972]
We present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens.
Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval.
arXiv Detail & Related papers (2023-09-19T04:13:22Z) - A Systematic Evaluation of Large Language Models of Code [88.34057460577957]
Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions.
The current state-of-the-art code LMs are not publicly available, leaving many questions about their model and data design decisions.
Although Codex is not open-source, we find that existing open-source models do achieve close results in some programming languages.
We release a new model, PolyCoder, with 2.7B parameters based on the GPT-2 architecture, which was trained on 249GB of code across 12 programming languages on a single machine.
arXiv Detail & Related papers (2022-02-26T15:53:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.