$\texttt{BluePrint}$: A Social Media User Dataset for LLM Persona Evaluation and Training
- URL: http://arxiv.org/abs/2510.02343v1
- Date: Sat, 27 Sep 2025 06:02:38 GMT
- Title: $\texttt{BluePrint}$: A Social Media User Dataset for LLM Persona Evaluation and Training
- Authors: Aurélien Bück-Kaeffer, Je Qin Chooi, Dan Zhao, Maximilian Puelma Touzel, Kellin Pelrine, Jean-François Godbout, Reihaneh Rabbany, Zachary Yang,
- Abstract summary: Large language models (LLMs) offer promising capabilities for social media dynamics at scale.<n>We introduce S, a framework for constructing behaviorally-grounded social media suitable for training agent models.<n>We release BluePrint, a large-scale dataset built from public Bluesky data focused on political discourse.
- Score: 8.563967699751684
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large language models (LLMs) offer promising capabilities for simulating social media dynamics at scale, enabling studies that would be ethically or logistically challenging with human subjects. However, the field lacks standardized data resources for fine-tuning and evaluating LLMs as realistic social media agents. We address this gap by introducing SIMPACT, the SIMulation-oriented Persona and Action Capture Toolkit, a privacy respecting framework for constructing behaviorally-grounded social media datasets suitable for training agent models. We formulate next-action prediction as a task for training and evaluating LLM-based agents and introduce metrics at both the cluster and population levels to assess behavioral fidelity and stylistic realism. As a concrete implementation, we release BluePrint, a large-scale dataset built from public Bluesky data focused on political discourse. BluePrint clusters anonymized users into personas of aggregated behaviours, capturing authentic engagement patterns while safeguarding privacy through pseudonymization and removal of personally identifiable information. The dataset includes a sizable action set of 12 social media interaction types (likes, replies, reposts, etc.), each instance tied to the posting activity preceding it. This supports the development of agents that use context-dependence, not only in the language, but also in the interaction behaviours of social media to model social media users. By standardizing data and evaluation protocols, SIMPACT provides a foundation for advancing rigorous, ethically responsible social media simulations. BluePrint serves as both an evaluation benchmark for political discourse modeling and a template for building domain specific datasets to study challenges such as misinformation and polarization.
Related papers
- Interpretable Debiasing of Vision-Language Models for Social Fairness [55.85977929985967]
We introduce an interpretable, model-agnostic bias mitigation framework, DeBiasLens, that localizes social attribute neurons in Vision-Language models.<n>We train SAEs on facial image or caption datasets without corresponding social attribute labels to uncover neurons highly responsive to specific demographics.<n>Our research lays the groundwork for future auditing tools, prioritizing social fairness in emerging real-world AI systems.
arXiv Detail & Related papers (2026-02-27T13:37:11Z) - HumanLLM: Towards Personalized Understanding and Simulation of Human Nature [72.55730315685837]
HumanLLM is a foundation model designed for personalized understanding and simulation of individuals.<n>We first construct the Cognitive Genome, a large-scale corpus curated from real-world user data on platforms like Reddit, Twitter, Blogger, and Amazon.<n>We then formulate diverse learning tasks and perform supervised fine-tuning to empower the model to predict a wide range of individualized human behaviors, thoughts, and experiences.
arXiv Detail & Related papers (2026-01-22T09:27:27Z) - Agent-based simulation of online social networks and disinformation [35.38015952060615]
This paper presents a simulation framework that models synthetic social networks with agents endowed with demographic-based personality traits and finite-state behavioral automata.<n>A generative module powered by a large language model (LLM) produces context-aware social media posts consistent with each agent's profile and memory.<n>A red module implements DISARM-inspired disinformation campaigns executed by malicious agents targeting simulated audiences.
arXiv Detail & Related papers (2025-12-26T16:56:45Z) - Social-Media Based Personas Challenge: Hybrid Prediction of Common and Rare User Actions on Bluesky [0.7305019142196582]
This paper presents a hybrid methodology for social media user behavior prediction.<n>It addresses both frequent and infrequent actions across a diverse action vocabulary.<n>Our approach achieved first place in the SocialSim: Social-Media Based Personas challenge.
arXiv Detail & Related papers (2025-11-21T13:40:14Z) - Simulating and Experimenting with Social Media Mobilization Using LLM Agents [7.262048441360133]
Building on the landmark 61-million-person Facebook experiment citepbond201261, we develop an agent-based simulation framework.<n>We integrate real U.S. Census demographic distributions, authentic Twitter network topology, and heterogeneous large language model (LLM) agents to examine the effect of mobilization messages on voter turnout.
arXiv Detail & Related papers (2025-10-30T13:43:28Z) - Population-Aligned Persona Generation for LLM-based Social Simulation [58.8436379542149]
We propose a systematic framework for synthesizing high-quality, population-aligned persona sets for social simulation.<n>Our approach begins by leveraging large language models to generate narrative personas from long-term social media data.<n>To address the needs of specific simulation contexts, we introduce a task-specific module that adapts the globally aligned persona set to targeted subpopulations.
arXiv Detail & Related papers (2025-09-12T10:43:47Z) - PANORAMA: A synthetic PII-laced dataset for studying sensitive data memorization in LLMs [0.0]
memorization of sensitive and personally identifiable information poses growing privacy risks.<n>Existing efforts to study sensitive and PII data memorization and develop mitigation strategies are hampered by the absence of realistic datasets.<n>We introduce PANORAMA - Profile-based Assemblage for Naturalistic Online Representation and Attribute Memorization Analysis.
arXiv Detail & Related papers (2025-05-18T05:27:35Z) - SCRAG: Social Computing-Based Retrieval Augmented Generation for Community Response Forecasting in Social Media Environments [8.743208265682014]
SCRAG is a prediction framework inspired by social computing.<n>It forecast community responses to real or hypothetical social media posts.<n>It can be used by public relations specialists to craft messaging in ways that avoid unintended misinterpretations.
arXiv Detail & Related papers (2025-04-18T15:02:31Z) - Agentic Society: Merging skeleton from real world and texture from Large Language Model [4.740886789811429]
This paper explores a novel framework that leverages census data and large language models to generate virtual populations.
We show that our method produces personas with variability essential for simulating diverse human behaviors in social science experiments.
But the evaluation result shows that only weak sign of statistical truthfulness can be produced due to limited capability of current LLMs.
arXiv Detail & Related papers (2024-09-02T08:28:19Z) - PersLLM: A Personified Training Approach for Large Language Models [66.16513246245401]
We propose PersLLM, a framework for better data construction and model tuning.<n>For insufficient data usage, we incorporate strategies such as Chain-of-Thought prompting and anti-induction.<n>For rigid behavior patterns, we design the tuning process and introduce automated DPO to enhance the specificity and dynamism of the models' personalities.
arXiv Detail & Related papers (2024-07-17T08:13:22Z) - Modeling Political Orientation of Social Media Posts: An Extended
Analysis [0.0]
Developing machine learning models to characterize political polarization on online social media presents significant challenges.
These challenges mainly stem from various factors such as the lack of annotated data, presence of noise in social media datasets, and the sheer volume of data.
We introduce two methods that leverage on news media bias and post content to label social media posts.
We demonstrate that current machine learning models can exhibit improved performance in predicting political orientation of social media posts.
arXiv Detail & Related papers (2023-11-21T03:34:20Z) - Bias and Fairness in Large Language Models: A Survey [73.87651986156006]
We present a comprehensive survey of bias evaluation and mitigation techniques for large language models (LLMs)
We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing.
We then unify the literature by proposing three intuitive, two for bias evaluation, and one for mitigation.
arXiv Detail & Related papers (2023-09-02T00:32:55Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - Social Processes: Self-Supervised Forecasting of Nonverbal Cues in
Social Conversations [22.302509912465077]
We take the first step in the direction of a bottom-up self-supervised approach in the domain of social human interactions.
We formulate the task of Social Cue Forecasting to leverage the larger amount of unlabeled low-level behavior cues.
We propose the Social Process (SP) models--socially aware sequence-to-sequence (Seq2Seq) models within the Neural Process (NP) family.
arXiv Detail & Related papers (2021-07-28T18:01:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.