Related papers: h2oGPT: Democratizing Large Language Models

h2oGPT: Democratizing Large Language Models

URL: http://arxiv.org/abs/2306.08161v2
Date: Fri, 16 Jun 2023 17:48:22 GMT
Title: h2oGPT: Democratizing Large Language Models
Authors: Arno Candel, Jon McKinney, Philipp Singer, Pascal Pfeiffer, Maximilian Jeblick, Prithvi Prabhu, Jeff Gambera, Mark Landry, Shivam Bansal, Ryan Chesler, Chun Ming Lee, Marcos V. Conde, Pasha Stetsenko, Olivier Grellier, SriSatish Ambati
Abstract summary: We introduce h2oGPT, a suite of open-source code repositories for the creation and use of Large Language Models. The goal of this project is to create the world's best truly open-source alternative to closed-source approaches.
Score: 1.8043055303852882
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Applications built on top of Large Language Models (LLMs) such as GPT-4 represent a revolution in AI due to their human-level capabilities in natural language processing. However, they also pose many significant risks such as the presence of biased, private, or harmful text, and the unauthorized inclusion of copyrighted material. We introduce h2oGPT, a suite of open-source code repositories for the creation and use of LLMs based on Generative Pretrained Transformers (GPTs). The goal of this project is to create the world's best truly open-source alternative to closed-source approaches. In collaboration with and as part of the incredible and unstoppable open-source community, we open-source several fine-tuned h2oGPT models from 7 to 40 Billion parameters, ready for commercial use under fully permissive Apache 2.0 licenses. Included in our release is 100\% private document search using natural language. Open-source language models help boost AI development and make it more accessible and trustworthy. They lower entry hurdles, allowing people and groups to tailor these models to their needs. This openness increases innovation, transparency, and fairness. An open-source strategy is needed to share AI benefits fairly, and H2O.ai will continue to democratize AI and LLMs.

Related papers

Comprehensive Analysis of Transparency and Accessibility of ChatGPT, DeepSeek, And other SoTA Large Language Models [2.6900047294457683]
Despite increasing discussions on open-source Artificial Intelligence (AI), existing research lacks a discussion on the transparency and accessibility of state-of-the-art (SoTA) Large Language Models (LLMs) This study critically analyzes SoTA LLMs from the last five years, including ChatGPT, DeepSeek, LLaMA, and others, to assess their adherence to transparency standards and the implications of partial openness. Our findings reveal that while some models are labeled as open-source, this does not necessarily mean they are fully open-sourced.
arXiv Detail & Related papers (2025-02-21T23:53:13Z)
The Open Source Advantage in Large Language Models (LLMs) [0.0]
Large language models (LLMs) have rapidly advanced natural language processing, driving significant breakthroughs in tasks such as text generation, machine translation, and domain-specific reasoning. The field now faces a critical dilemma in its approach: closed-source models like GPT-4 deliver state-of-the-art performance but restrict accessibility, and external oversight. Open-source frameworks like LLaMA and Mixtral democratize access, foster collaboration, and support diverse applications, achieving competitive results through techniques like instruction tuning and LoRA.
arXiv Detail & Related papers (2024-12-16T17:32:11Z)
Free to play: UN Trade and Development's experience with developing its own open-source Retrieval Augmented Generation Large Language Model application [0.0]
UNCTAD has explored and developed its own open-source Retrieval Augmented Generation (RAG) LLM application. RAG makes Large Language Models aware of and more useful for the organization's domain and work. Three libraries developed to produce the app, nlp_pipeline for document processing and statistical analysis, local_rag_llm for running a local RAG LLM, and streamlit_rag for the user interface, are publicly available on PyPI and GitHub with Dockerfiles.
arXiv Detail & Related papers (2024-06-18T14:23:54Z)
Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order [123.7406091753529]
Aurora-M is a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions. It is rigorously evaluated across various tasks and languages, demonstrating robustness against catastrophic forgetting.
arXiv Detail & Related papers (2024-03-30T15:38:54Z)
Is open source software culture enough to make AI a common ? [0.0]
Language models (LM) are increasingly deployed in the field of artificial intelligence (AI) The question arises as to whether they can be a common resource managed and maintained by a community of users. We highlight the potential benefits of treating the data and resources needed to create LMs as commons.
arXiv Detail & Related papers (2024-03-19T14:43:52Z)
TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese [0.0]
Large language models (LLMs) have significantly advanced natural language processing, but their progress has yet to be equal across languages. In this study, we document the development of open-foundation models tailored for use in low-resource settings. This is the TeenyTinyLlama pair: two compact models for Brazilian Portuguese text generation.
arXiv Detail & Related papers (2024-01-30T00:25:54Z)
YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters. YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline. The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z)
GlotLID: Language Identification for Low-Resource Languages [51.38634652914054]
GlotLID-M is an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work.
arXiv Detail & Related papers (2023-10-24T23:45:57Z)
H2O Open Ecosystem for State-of-the-art Large Language Models [10.04351591653126]
Large Language Models (LLMs) represent a revolution in AI. They also pose many significant risks, such as the presence of biased, private, copyrighted or harmful text. We introduce a complete open-source ecosystem for developing and testing LLMs.
arXiv Detail & Related papers (2023-10-17T09:40:58Z)
On the Safety of Open-Sourced Large Language Models: Does Alignment Really Prevent Them From Being Misused? [49.99955642001019]
We show that open-sourced, aligned large language models could be easily misguided to generate undesired content. Our key idea is to directly manipulate the generation process of open-sourced LLMs to misguide it to generate undesired content.
arXiv Detail & Related papers (2023-10-02T19:22:01Z)
Open-Sourcing Highly Capable Foundation Models: An evaluation of risks, benefits, and alternative methods for pursuing open-source objectives [6.575445633821399]
Recent decisions by leading AI labs to either open-source their models or to restrict access to their models has sparked debate. This paper offers an examination of the risks and benefits of open-sourcing highly capable foundation models.
arXiv Detail & Related papers (2023-09-29T17:03:45Z)
Baichuan 2: Open Large-scale Language Models [51.56361715162972]
We present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens. Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval.
arXiv Detail & Related papers (2023-09-19T04:13:22Z)
A Systematic Evaluation of Large Language Models of Code [88.34057460577957]
Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions. The current state-of-the-art code LMs are not publicly available, leaving many questions about their model and data design decisions. Although Codex is not open-source, we find that existing open-source models do achieve close results in some programming languages. We release a new model, PolyCoder, with 2.7B parameters based on the GPT-2 architecture, which was trained on 249GB of code across 12 programming languages on a single machine.
arXiv Detail & Related papers (2022-02-26T15:53:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.