LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
- URL: http://arxiv.org/abs/2309.11998v4
- Date: Sun, 10 Mar 2024 19:34:57 GMT
- Title: LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
- Authors: Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang,
Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric P. Xing, Joseph E.
Gonzalez, Ion Stoica, Hao Zhang
- Abstract summary: We introduce LMSYS-Chat-1M, a large-scale dataset containing one million real-world conversations with 25 state-of-the-art large language models (LLMs)
This dataset is collected from 210K IP addresses in the wild on our Vicuna demo and Arena website.
We demonstrate its versatility through four use cases: developing content moderation models that perform similarly to GPT-4, building a safety benchmark, training instruction-following models that perform similarly to Vicuna, and creating challenging benchmark questions.
- Score: 75.9621305227523
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Studying how people interact with large language models (LLMs) in real-world
scenarios is increasingly important due to their widespread use in various
applications. In this paper, we introduce LMSYS-Chat-1M, a large-scale dataset
containing one million real-world conversations with 25 state-of-the-art LLMs.
This dataset is collected from 210K unique IP addresses in the wild on our
Vicuna demo and Chatbot Arena website. We offer an overview of the dataset's
content, including its curation process, basic statistics, and topic
distribution, highlighting its diversity, originality, and scale. We
demonstrate its versatility through four use cases: developing content
moderation models that perform similarly to GPT-4, building a safety benchmark,
training instruction-following models that perform similarly to Vicuna, and
creating challenging benchmark questions. We believe that this dataset will
serve as a valuable resource for understanding and advancing LLM capabilities.
The dataset is publicly available at
https://huggingface.co/datasets/lmsys/lmsys-chat-1m.
Related papers
- InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning [58.7966588457529]
InfiMM-WebMath-40B is a high-quality dataset of interleaved image-text documents.
It comprises 24 million web pages, 85 million associated image URLs, and 40 billion text tokens, all meticulously extracted and filtered from CommonCrawl.
Our evaluations on text-only benchmarks show that, despite utilizing only 40 billion tokens, our dataset significantly enhances the performance of our 1.3B model.
Our models set a new state-of-the-art among open-source models on multi-modal math benchmarks such as MathVerse and We-Math.
arXiv Detail & Related papers (2024-09-19T08:41:21Z) - NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models [21.10890310571397]
Large Language Models (LLMs) can be applied to a diverse set of tasks, but the critical issues of data contamination and memorization are often glossed over.
This work introduces a variety of different techniques to assess whether a language model has seen a dataset during training.
We then compare the few-shot learning performance of LLMs on datasets that were seen during training to the performance on datasets released after training.
arXiv Detail & Related papers (2024-04-09T10:58:21Z) - Are We on the Right Way for Evaluating Large Vision-Language Models? [92.5761176224556]
Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities.
We identify two primary issues: Visual content is unnecessary for many samples and intentional data leakage exists.
We present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 samples meticulously selected by humans.
arXiv Detail & Related papers (2024-03-29T17:59:34Z) - Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource Languages [0.0]
Large Language Models (LLMs) have shown incredible proficiency at natural language processing tasks.
LLMs often struggle to perform well on low-resource languages because there is so little training data available.
In this work, we explore training LLaMA-2 to speak Amharic, a language which is spoken by over 50 million people world wide.
arXiv Detail & Related papers (2024-03-11T01:04:36Z) - Datasets for Large Language Models: A Comprehensive Survey [37.153302283062004]
The survey consolidates and categorizes the fundamental aspects of LLM datasets from five perspectives.
The survey sheds light on the prevailing challenges and points out potential avenues for future investigation.
The total data size surveyed surpasses 774.5 TB for pre-training corpora and 700M instances for other datasets.
arXiv Detail & Related papers (2024-02-28T04:35:51Z) - Delving Deeper into Data Scaling in Masked Image Modeling [145.36501330782357]
We conduct an empirical study on the scaling capability of masked image modeling (MIM) methods for visual recognition.
Specifically, we utilize the web-collected Coyo-700M dataset.
Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models.
arXiv Detail & Related papers (2023-05-24T15:33:46Z) - Benchmarking Multimodal AutoML for Tabular Data with Text Fields [83.43249184357053]
We assemble 18 multimodal data tables that each contain some text fields.
Our benchmark enables researchers to evaluate their own methods for supervised learning with numeric, categorical, and text features.
arXiv Detail & Related papers (2021-11-04T09:29:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.