Related papers: The Collective Turing Test: Large Language Models Can Generate Realistic Multi-User Discussions

The Collective Turing Test: Large Language Models Can Generate Realistic Multi-User Discussions

URL: http://arxiv.org/abs/2511.08592v1
Date: Wed, 29 Oct 2025 17:01:20 GMT
Title: The Collective Turing Test: Large Language Models Can Generate Realistic Multi-User Discussions
Authors: Azza Bouleimen, Giordano De Marzo, Taehee Kim, Nicol`o Pagan, Hannah Metzler, Silvia Giordano, David Garcia,
Abstract summary: Large Language Models (LLMs) offer new avenues to simulate online communities and social media.<n>We evaluated whether LLMs can convincingly mimic human group conversations on social media.
Score: 0.4605116997238364
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) offer new avenues to simulate online communities and social media. Potential applications range from testing the design of content recommendation algorithms to estimating the effects of content policies and interventions. However, the validity of using LLMs to simulate conversations between various users remains largely untested. We evaluated whether LLMs can convincingly mimic human group conversations on social media. We collected authentic human conversations from Reddit and generated artificial conversations on the same topic with two LLMs: Llama 3 70B and GPT-4o. When presented side-by-side to study participants, LLM-generated conversations were mistaken for human-created content 39\% of the time. In particular, when evaluating conversations generated by Llama 3, participants correctly identified them as AI-generated only 56\% of the time, barely better than random chance. Our study demonstrates that LLMs can generate social media conversations sufficiently realistic to deceive humans when reading them, highlighting both a promising potential for social simulation and a warning message about the potential misuse of LLMs to generate new inauthentic social media content.

Related papers

From Moderation to Mediation: Can LLMs Serve as Mediators in Online Flame Wars? [7.926773786209838]
Large language models (LLMs) have opened new possibilities for AI for good applications.<n>This work explores whether LLMs can serve as moderators that detect harmful content, but as mediators capable of understanding and de-escalating online conflicts.<n>Our framework decomposes mediation into two subtasks: judgment, where an LLM evaluates the fairness and emotional dynamics of a conversation, and steering, where it generates empathetic, de-escalatory messages.
arXiv Detail & Related papers (2025-12-02T18:31:18Z)
Promoting Online Safety by Simulating Unsafe Conversations with LLMs [1.7243216387069678]
Large language models (LLMs) have the potential -- and already are being used -- to increase the speed, scale, and types of unsafe conversations online.<n>In our current work, we explore ways to promote online safety by teaching people about unsafe conversations that can occur online with and without LLMs.
arXiv Detail & Related papers (2025-07-29T22:38:21Z)
Can LLMs Simulate Social Media Engagement? A Study on Action-Guided Response Generation [51.44040615856536]
This paper analyzes large language models' ability to simulate social media engagement through action guided response generation.<n>We benchmark GPT-4o-mini, O1-mini, and DeepSeek-R1 in social media engagement simulation regarding a major societal event.
arXiv Detail & Related papers (2025-02-17T17:43:08Z)
GuideLLM: Exploring LLM-Guided Conversation with Applications in Autobiography Interviewing [73.8469700907927]
Large Language Models (LLMs) succeed in human-guided conversations such as instruction following and question answering.<n>In this study, we first characterize LLM-guided conversation into three fundamental components: Goal Navigation; (ii) Context Management; (iii) Empathetic Engagement.<n>We compare GuideLLM with 6 state-of-the-art LLMs such as GPT-4o and Llama-3-70b-Instruct, from the perspective of interviewing quality, and autobiography generation quality.
arXiv Detail & Related papers (2025-02-10T14:11:32Z)
NewsInterview: a Dataset and a Playground to Evaluate LLMs' Ground Gap via Informational Interviews [65.35458530702442]
We focus on journalistic interviews, a domain rich in grounding communication and abundant in data. We curate a dataset of 40,000 two-person informational interviews from NPR and CNN. LLMs are significantly less likely than human interviewers to use acknowledgements and to pivot to higher-level questions.
arXiv Detail & Related papers (2024-11-21T01:37:38Z)
Real or Robotic? Assessing Whether LLMs Accurately Simulate Qualities of Human Responses in Dialogue [25.89926022671521]
We generate a large-scale dataset of 100,000 paired LLM-LLM and human-LLM dialogues from the WildChat dataset. We find relatively low alignment between simulations and human interactions, demonstrating a systematic divergence along the multiple textual properties.
arXiv Detail & Related papers (2024-09-12T18:00:18Z)
LLM Roleplay: Simulating Human-Chatbot Interaction [52.03241266241294]
We propose a goal-oriented, persona-based method to automatically generate diverse multi-turn dialogues simulating human-chatbot interaction. Our method can simulate human-chatbot dialogues with a high indistinguishability rate.
arXiv Detail & Related papers (2024-07-04T14:49:46Z)
Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models [66.24055500785657]
Traditional turn-based chat systems prevent users from verbally interacting with system while it is generating responses. To overcome these limitations, we adapt existing LLMs to listen users while generating output and provide users with instant feedback. We build a dataset consisting of alternating time slices of queries and responses as well as covering typical feedback types in instantaneous interactions.
arXiv Detail & Related papers (2024-06-22T03:20:10Z)
BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues [72.65163468440434]
This report provides a preliminary evaluation of existing large language models for human-style multi-turn chatting. We prompt large language models (LLMs) to generate a full multi-turn dialogue based on the ChatSEED, utterance by utterance. We find GPT-4 can generate human-style multi-turn dialogues with impressive quality, significantly outperforms its counterparts.
arXiv Detail & Related papers (2023-10-20T16:53:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.