Related papers: Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)

Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)

URL: http://arxiv.org/abs/2510.22954v1
Date: Mon, 27 Oct 2025 03:16:21 GMT
Title: Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)
Authors: Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, Maarten Sap, Alon Albalak, Yejin Choi,
Abstract summary: Language models (LMs) often struggle to generate diverse, human-like creative content.<n>We introduce Infinity-Chat, a large-scale dataset of 26K diverse, real-world, open-ended user queries.<n>We present a large-scale study of mode collapse in LMs, revealing a pronounced Artificial Hivemind effect.
Score: 90.45301024940329
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language models (LMs) often struggle to generate diverse, human-like creative content, raising concerns about the long-term homogenization of human thought through repeated exposure to similar outputs. Yet scalable methods for evaluating LM output diversity remain limited, especially beyond narrow tasks such as random number or name generation, or beyond repeated sampling from a single model. We introduce Infinity-Chat, a large-scale dataset of 26K diverse, real-world, open-ended user queries that admit a wide range of plausible answers with no single ground truth. We introduce the first comprehensive taxonomy for characterizing the full spectrum of open-ended prompts posed to LMs, comprising 6 top-level categories (e.g., brainstorm & ideation) that further breaks down to 17 subcategories. Using Infinity-Chat, we present a large-scale study of mode collapse in LMs, revealing a pronounced Artificial Hivemind effect in open-ended generation of LMs, characterized by (1) intra-model repetition, where a single model consistently generates similar responses, and more so (2) inter-model homogeneity, where different models produce strikingly similar outputs. Infinity-Chat also includes 31,250 human annotations, across absolute ratings and pairwise preferences, with 25 independent human annotations per example. This enables studying collective and individual-specific human preferences in response to open-ended queries. Our findings show that LMs, reward models, and LM judges are less well calibrated to human ratings on model generations that elicit differing idiosyncratic annotator preferences, despite maintaining comparable overall quality. Overall, INFINITY-CHAT presents the first large-scale resource for systematically studying real-world open-ended queries to LMs, revealing critical insights to guide future research for mitigating long-term AI safety risks posed by the Artificial Hivemind.

Related papers

Seeing Is Believing? A Benchmark for Multimodal Large Language Models on Visual Illusions and Anomalies [40.03295633717008]
We introduce VIA-Bench, a benchmark designed to probe model performance on visual illusions and anomalies.<n>We construct over 1K high-quality question-answer pairs that require nuanced visual reasoning.<n>Our findings reveal a fundamental divergence between machine and human perception, suggesting that resolving such perceptual bottlenecks is critical for the advancement of artificial general intelligence.
arXiv Detail & Related papers (2026-02-02T08:48:03Z)
Generative Human-Object Interaction Detection via Differentiable Cognitive Steering of Multi-modal LLMs [85.69785384599827]
Human-object interaction (HOI) detection aims to localize human-object pairs and the interactions between them.<n>Existing methods operate under a closed-world assumption, treating the task as a classification problem over a small, predefined verb set.<n>We propose GRASP-HO, a novel Generative Reasoning And Steerable Perception framework that reformulates HOI detection from the closed-set classification task to the open-vocabulary generation problem.
arXiv Detail & Related papers (2025-12-19T14:41:50Z)
Who are you, ChatGPT? Personality and Demographic Style in LLM-Generated Content [5.515596385935823]
Generative large language models (LLMs) have become central to everyday life, producing human-like text across diverse domains.<n>A growing body of research investigates whether these models also exhibit personality- and demographic-like characteristics in their language.<n>We introduce a novel, data-driven methodology for assessing LLM personality without relying on self-report questionnaires.<n>Applying instead automatic personality and gender classifiers to model replies on open-ended questions collected from Reddit.
arXiv Detail & Related papers (2025-10-13T14:06:17Z)
Modeling Open-World Cognition as On-Demand Synthesis of Probabilistic Models [93.1043186636177]
We explore the hypothesis that people use a combination of distributed and symbolic representations to construct bespoke mental models tailored to novel situations.<n>We propose a computational implementation of this idea -- a Model Synthesis Architecture''<n>We evaluate our MSA as a model of human judgments on a novel reasoning dataset.
arXiv Detail & Related papers (2025-07-16T18:01:03Z)
Empirically evaluating commonsense intelligence in large language models with large-scale human judgments [4.212429064310439]
We propose a method for evaluating common sense in artificial intelligence.<n>We measure the correspondence between a model's judgment and that of a human population.<n>Our framework contributes to the growing call for adapting AI models to human collectivities that possess different, often incompatible, social stocks of knowledge.
arXiv Detail & Related papers (2025-05-15T13:55:27Z)
Deep Generative Model-Based Generation of Synthetic Individual-Specific Brain MRI Segmentations [6.66216112298345]
We propose the first approach capable of generating synthetic brain MRI segmentations for individuals.<n>Our approach features a novel deep generative model, CSeg Synth, which outperforms existing prominent generative models.<n>In assessing the effectiveness of the individual-specific generation, we achieve superior volume prediction, with mean absolute errors of only 36.44mL, 29.20mL, and 35.51mL.
arXiv Detail & Related papers (2025-04-15T21:25:36Z)
Giving AI Personalities Leads to More Human-Like Reasoning [7.124736158080938]
We investigate the potential of AI to mimic diverse reasoning behaviors across a human population.<n>We designed reasoning tasks using a novel generalization of the Natural Language Inference (NLI) format.<n>We used personality-based prompting inspired by the Big Five personality model to elicit AI responses reflecting specific personality traits.
arXiv Detail & Related papers (2025-02-19T23:51:23Z)
Evaluation of Large Language Models via Coupled Token Generation [19.187846871216568]
State of the art large language models rely on randomization to respond to a prompt.<n>We argue that the evaluation and ranking of large language models should control for the randomization underpinning their functioning.
arXiv Detail & Related papers (2025-02-03T19:01:17Z)
MONAL: Model Autophagy Analysis for Modeling Human-AI Interactions [11.972017738888825]
We propose Model Autophagy Analysis (MONAL) for large models' self-consumption explanation. MONAL employs two distinct autophagous loops to elucidate the suppression of human-generated information in the exchange between human and AI systems. We evaluate the capacities of generated models as both creators and disseminators of information.
arXiv Detail & Related papers (2024-02-17T13:02:54Z)
Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines. Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations' In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z)
Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges. Our model is trained on user queries and LLM-generated responses under massive real-world scenarios. Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z)
AvgOut: A Simple Output-Probability Measure to Eliminate Dull Responses [97.50616524350123]
We build dialogue models that are dynamically aware of what utterances or tokens are dull without any feature-engineering. The first model, MinAvgOut, directly maximizes the diversity score through the output distributions of each batch. The second model, Label Fine-Tuning (LFT), prepends to the source sequence a label continuously scaled by the diversity score to control the diversity level. The third model, RL, adopts Reinforcement Learning and treats the diversity score as a reward signal.
arXiv Detail & Related papers (2020-01-15T18:32:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.