Related papers: Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction

Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction

URL: http://arxiv.org/abs/2602.22752v1
Date: Thu, 26 Feb 2026 08:40:21 GMT
Title: Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction
Authors: Nils Schwager, Simon Münker, Alistair Plum, Achim Rettinger,
Abstract summary: This study introduces Conditioned Comment Prediction (CCP), a task in which a model predicts how a user would comment on a given stimulus.<n>We evaluate open-weight 8B models (Llama3.1, Qwen3, Ministral) in English, German, and Luxembourgish language scenarios.
Score: 2.5450067638785945
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The transition of Large Language Models (LLMs) from exploratory tools to active "silicon subjects" in social science lacks extensive validation of operational validity. This study introduces Conditioned Comment Prediction (CCP), a task in which a model predicts how a user would comment on a given stimulus by comparing generated outputs with authentic digital traces. This framework enables a rigorous evaluation of current LLM capabilities with respect to the simulation of social media user behavior. We evaluated open-weight 8B models (Llama3.1, Qwen3, Ministral) in English, German, and Luxembourgish language scenarios. By systematically comparing prompting strategies (explicit vs. implicit) and the impact of Supervised Fine-Tuning (SFT), we identify a critical form vs. content decoupling in low-resource settings: while SFT aligns the surface structure of the text output (length and syntax), it degrades semantic grounding. Furthermore, we demonstrate that explicit conditioning (generated biographies) becomes redundant under fine-tuning, as models successfully perform latent inference directly from behavioral histories. Our findings challenge current "naive prompting" paradigms and offer operational guidelines prioritizing authentic behavioral traces over descriptive personas for high-fidelity simulation.

Related papers

Do LLM Self-Explanations Help Users Predict Model Behavior? Evaluating Counterfactual Simulatability with Pragmatic Perturbations [1.8772057593980798]
Large Language Models (LLMs) can produce verbalized self-explanations.<n>We evaluate how well humans and LLM judges can predict a model's answers to counterfactual follow-up questions.
arXiv Detail & Related papers (2026-01-07T10:13:26Z)
See, Think, Act: Online Shopper Behavior Simulation with VLM Agents [58.92444959954643]
This paper investigates the integration of visual information, specifically webpage screenshots, into behavior simulation via VLMs.<n>We employ SFT for joint action prediction and rationale generation, conditioning on the full interaction context.<n>To further enhance reasoning capabilities, we integrate RL with a hierarchical reward structure, scaled by a difficulty-aware factor.
arXiv Detail & Related papers (2025-10-22T05:07:14Z)
When Meaning Stays the Same, but Models Drift: Evaluating Quality of Service under Token-Level Behavioral Instability in LLMs [4.265831047857601]
We investigate how large language models respond to prompts that differ only in their token-level realization but preserve the same semantic intent.<n>We propose Prompt-Based Semantic Shift (PBSS), a diagnostic framework for measuring behavioral drift in LLMs under semantically equivalent prompt rewordings.
arXiv Detail & Related papers (2025-06-11T18:26:32Z)
Sense and Sensitivity: Evaluating the simulation of social dynamics via Large Language Models [27.313165173789233]
Large language models have been proposed as a powerful replacement for classical agent-based models (ABMs) to simulate social dynamics.<n>However, due to the black box nature of LLMs, it is unclear whether LLM agents actually execute the intended semantics.<n>We show that while it is possible to engineer prompts that approximate the intended dynamics, the quality of these simulations is highly sensitive to the particular choice of prompts.
arXiv Detail & Related papers (2024-12-06T14:50:01Z)
Designing Domain-Specific Large Language Models: The Critical Role of Fine-Tuning in Public Opinion Simulation [0.0]
This paper introduces a novel fine-tuning approach that integrates socio-demographic data from the UK Household Longitudinal Study.<n>By emulating diverse synthetic profiles, the fine-tuned models significantly outperform pre-trained counterparts.<n>Its broader implications include deploying LLMs in domains like healthcare and education, fostering inclusive and data-driven decision-making.
arXiv Detail & Related papers (2024-09-28T10:39:23Z)
LangSuitE: Planning, Controlling and Interacting with Large Language Models in Embodied Text Environments [70.91258869156353]
We introduce LangSuitE, a versatile and simulation-free testbed featuring 6 representative embodied tasks in textual embodied worlds. Compared with previous LLM-based testbeds, LangSuitE offers adaptability to diverse environments without multiple simulation engines. We devise a novel chain-of-thought (CoT) schema, EmMem, which summarizes embodied states w.r.t. history information.
arXiv Detail & Related papers (2024-06-24T03:36:29Z)
CELA: Cost-Efficient Language Model Alignment for CTR Prediction [70.65910069412944]
Click-Through Rate (CTR) prediction holds a paramount position in recommender systems.<n>Recent efforts have sought to mitigate these challenges by integrating Pre-trained Language Models (PLMs)<n>We propose textbfCost-textbfEfficient textbfLanguage Model textbfAlignment (textbfCELA) for CTR prediction.
arXiv Detail & Related papers (2024-05-17T07:43:25Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
Rethinking the Evaluation for Conversational Recommendation in the Era of Large Language Models [115.7508325840751]
The recent success of large language models (LLMs) has shown great potential to develop more powerful conversational recommender systems (CRSs) In this paper, we embark on an investigation into the utilization of ChatGPT for conversational recommendation, revealing the inadequacy of the existing evaluation protocol. We propose an interactive Evaluation approach based on LLMs named iEvaLM that harnesses LLM-based user simulators.
arXiv Detail & Related papers (2023-05-22T15:12:43Z)
Explaining Language Models' Predictions with High-Impact Concepts [11.47612457613113]
We propose a complete framework for extending concept-based interpretability methods to NLP. We optimize for features whose existence causes the output predictions to change substantially. Our method achieves superior results on predictive impact, usability, and faithfulness compared to the baselines.
arXiv Detail & Related papers (2023-05-03T14:48:27Z)
Did the Cat Drink the Coffee? Challenging Transformers with Generalized Event Knowledge [59.22170796793179]
Transformers Language Models (TLMs) were tested on a benchmark for the textitdynamic estimation of thematic fit Our results show that TLMs can reach performances that are comparable to those achieved by SDM. However, additional analysis consistently suggests that TLMs do not capture important aspects of event knowledge.
arXiv Detail & Related papers (2021-07-22T20:52:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.