Related papers: WildChat: 1M ChatGPT Interaction Logs in the Wild

WildChat: 1M ChatGPT Interaction Logs in the Wild

URL: http://arxiv.org/abs/2405.01470v1
Date: Thu, 2 May 2024 17:00:02 GMT
Title: WildChat: 1M ChatGPT Interaction Logs in the Wild
Authors: Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, Yuntian Deng,
Abstract summary: WildChat is a corpus of 1 million user-ChatGPT conversations, which consists of over 2.5 million interaction turns. In addition to timestamped chat transcripts, we enrich the dataset with demographic data, including state, country, and hashed IP addresses.
Score: 88.05964311416717
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Chatbots such as GPT-4 and ChatGPT are now serving millions of users. Despite their widespread use, there remains a lack of public datasets showcasing how these tools are used by a population of users in practice. To bridge this gap, we offered free access to ChatGPT for online users in exchange for their affirmative, consensual opt-in to anonymously collect their chat transcripts and request headers. From this, we compiled WildChat, a corpus of 1 million user-ChatGPT conversations, which consists of over 2.5 million interaction turns. We compare WildChat with other popular user-chatbot interaction datasets, and find that our dataset offers the most diverse user prompts, contains the largest number of languages, and presents the richest variety of potentially toxic use-cases for researchers to study. In addition to timestamped chat transcripts, we enrich the dataset with demographic data, including state, country, and hashed IP addresses, alongside request headers. This augmentation allows for more detailed analysis of user behaviors across different geographical regions and temporal dimensions. Finally, because it captures a broad range of use cases, we demonstrate the dataset's potential utility in fine-tuning instruction-following models. WildChat is released at https://wildchat.allen.ai under AI2 ImpACT Licenses.

Related papers

ShareChat: A Dataset of Chatbot Conversations in the Wild [11.008120181455316]
We present ShareChat, a large-scale, cross-platform corpus comprising 142,808 conversations and over 660,000 turns collected from publicly shared URLs across five major platforms.<n>We show ShareChat offers substantially longer context windows and greater interaction depth than prior datasets.
arXiv Detail & Related papers (2025-12-19T17:47:53Z)
RICoTA: Red-teaming of In-the-wild Conversation with Test Attempts [6.0385743836962025]
RICoTA is a Korean red teaming dataset that consists of 609 prompts challenging large language models (LLMs) We utilize user-chatbot conversations that were self-posted on a Korean Reddit-like community. Our dataset will be made publicly available via GitHub.
arXiv Detail & Related papers (2025-01-29T15:32:27Z)
Bots can Snoop: Uncovering and Mitigating Privacy Risks of Bots in Group Chats [2.835537619294564]
SnoopGuard is a group messaging protocol that ensures user privacy against chatbots while maintaining strong end-to-end security. Our prototype implementation shows that sending a message in a group of 50 users takes about 30 milliseconds when integrated with Message Layer Security (MLS)
arXiv Detail & Related papers (2024-10-09T06:37:41Z)
WildVis: Open Source Visualizer for Million-Scale Chat Logs in the Wild [88.05964311416717]
We introduce WildVis, an interactive tool that enables fast, versatile, and large-scale conversation analysis. WildVis provides search and visualization capabilities in the text and embedding spaces based on a list of criteria. We demonstrate WildVis' utility through three case studies: facilitating misuse research, visualizing and comparing topic distributions across datasets, and characterizing user-specific conversation patterns.
arXiv Detail & Related papers (2024-09-05T17:59:15Z)
Are LLM-based methods good enough for detecting unfair terms of service? [67.49487557224415]
Large language models (LLMs) are good at parsing long text-based documents. We build a dataset consisting of 12 questions applied individually to a set of privacy policies. Some open-source models are able to provide a higher accuracy compared to some commercial models.
arXiv Detail & Related papers (2024-08-24T09:26:59Z)
Can ChatGPT Reproduce Human-Generated Labels? A Study of Social Computing Tasks [9.740764281808588]
ChatGPT has the potential to reproduce human-generated label annotations in social computing tasks. We relabel five datasets covering stance detection (2x), sentiment analysis, hate speech, and bot detection. Our results highlight that ChatGPT does have the potential to handle these data annotation tasks, although a number of challenges remain.
arXiv Detail & Related papers (2023-04-20T08:08:12Z)
Rewarding Chatbots for Real-World Engagement with Millions of Users [1.2583983802175422]
This work investigates the development of social chatbots that prioritize user engagement to enhance retention. The proposed approach uses automatic pseudo-labels collected from user interactions to train a reward model that can be used to reject low-scoring sample responses. A/B testing on groups of 10,000 new dailychat users on the Chai Research platform shows that this approach increases the MCL by up to 70%. Future work aims to use the reward model to realise a data fly-wheel, where the latest user conversations can be used to alternately fine-tune the language model and the reward model.
arXiv Detail & Related papers (2023-03-10T18:53:52Z)
Knowledge-Grounded Conversational Data Augmentation with Generative Conversational Networks [76.11480953550013]
We take a step towards automatically generating conversational data using Generative Conversational Networks. We evaluate our approach on conversations with and without knowledge on the Topical Chat dataset.
arXiv Detail & Related papers (2022-07-22T22:37:14Z)
Training Conversational Agents with Generative Conversational Networks [74.9941330874663]
We use Generative Conversational Networks to automatically generate data and train social conversational agents. We evaluate our approach on TopicalChat with automatic metrics and human evaluators, showing that with 10% of seed data it performs close to the baseline that uses 100% of the data.
arXiv Detail & Related papers (2021-10-15T21:46:39Z)
Pchatbot: A Large-Scale Dataset for Personalized Chatbot [49.16746174238548]
We introduce Pchatbot, a large-scale dialogue dataset that contains two subsets collected from Weibo and Judicial forums respectively. To adapt the raw dataset to dialogue systems, we elaborately normalize the raw dataset via processes such as anonymization. The scale of Pchatbot is significantly larger than existing Chinese datasets, which might benefit the data-driven models.
arXiv Detail & Related papers (2020-09-28T12:49:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.