Next Reply Prediction X Dataset: Linguistic Discrepancies in Naively Generated Content
- URL: http://arxiv.org/abs/2602.19177v1
- Date: Sun, 22 Feb 2026 13:14:27 GMT
- Title: Next Reply Prediction X Dataset: Linguistic Discrepancies in Naively Generated Content
- Authors: Simon Münker, Nils Schwager, Kai Kugler, Michael Heseltine, Achim Rettinger,
- Abstract summary: Large Language Models (LLMs) as proxies for human participants in social science research presents a promising, yet methodologically risky, paradigm shift.<n>This paper addresses these limitations by introducing a novel, history-conditioned reply prediction task on authentic X (formerly Twitter) data.<n>We analyze these discrepancies using stylistic and content-based metrics, providing a quantitative framework for researchers to assess the quality and authenticity of synthetic data.
- Score: 1.215922138351105
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The increasing use of Large Language Models (LLMs) as proxies for human participants in social science research presents a promising, yet methodologically risky, paradigm shift. While LLMs offer scalability and cost-efficiency, their "naive" application, where they are prompted to generate content without explicit behavioral constraints, introduces significant linguistic discrepancies that challenge the validity of research findings. This paper addresses these limitations by introducing a novel, history-conditioned reply prediction task on authentic X (formerly Twitter) data, to create a dataset designed to evaluate the linguistic output of LLMs against human-generated content. We analyze these discrepancies using stylistic and content-based metrics, providing a quantitative framework for researchers to assess the quality and authenticity of synthetic data. Our findings highlight the need for more sophisticated prompting techniques and specialized datasets to ensure that LLM-generated content accurately reflects the complex linguistic patterns of human communication, thereby improving the validity of computational social science studies.
Related papers
- Computational Turing Test Reveals Systematic Differences Between Human and AI Language [0.0]
Large language models (LLMs) are increasingly used in the social sciences to simulate human behavior.<n>Existing validation efforts rely heavily on human-judgment-based evaluations.<n>This paper introduces a computational Turing test to assess how closely LLMs approximate human language.
arXiv Detail & Related papers (2025-11-06T08:56:37Z) - How Do LLM-Generated Texts Impact Term-Based Retrieval Models? [76.92519309816008]
This paper investigates the influence of large language models (LLMs) on term-based retrieval models.<n>Our linguistic analysis reveals that LLM-generated texts exhibit smoother high-frequency and steeper low-frequency Zipf slopes.<n>Our study further explores whether term-based retrieval models demonstrate source bias, concluding that these models prioritize documents whose term distributions closely correspond to those of the queries.
arXiv Detail & Related papers (2025-08-25T06:43:27Z) - Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs [6.719863580831653]
Synthetic data generated by Large Language Models (LLMs) provides cost-effective, scalable alternative to real-world data to facilitate model training.<n>We quantitatively assess the diversity (i.e., linguistic expression, sentiment, and user perspective) of synthetic datasets generated by several state-of-the-art LLMs.<n> Guided by the evaluation results, a prompt-based approach is proposed to enhance the diversity of synthetic reviews while preserving reviewer privacy.
arXiv Detail & Related papers (2025-07-24T03:12:16Z) - Synthetic Data Generation for Phrase Break Prediction with Large Language Model [5.483546934298434]
Large language models (LLMs) have shown success in addressing data challenges in NLP.<n>We explore leveraging LLM to generate synthetic phrase break annotations.<n>Our findings suggest that LLM-based synthetic data generation effectively mitigates data challenges in phrase break prediction.
arXiv Detail & Related papers (2025-07-24T02:45:03Z) - Synthetic Data Generation Using Large Language Models: Advances in Text and Code [0.0]
Large language models (LLMs) are transforming synthetic training data generation in both natural language and code domains.<n>We highlight key techniques such as prompt-based generation, retrieval-augmented pipelines, and iterative self-refinement.<n>We discuss the accompanying challenges, including factual inaccuracies in generated text, insufficient stylistic or distributional realism, and risks of bias amplification.
arXiv Detail & Related papers (2025-03-18T08:34:03Z) - Systematic Task Exploration with LLMs: A Study in Citation Text Generation [63.50597360948099]
Large language models (LLMs) bring unprecedented flexibility in defining and executing complex, creative natural language generation (NLG) tasks.
We propose a three-component research framework that consists of systematic input manipulation, reference data, and output measurement.
We use this framework to explore citation text generation -- a popular scholarly NLP task that lacks consensus on the task definition and evaluation metric.
arXiv Detail & Related papers (2024-07-04T16:41:08Z) - LLM-DA: Data Augmentation via Large Language Models for Few-Shot Named
Entity Recognition [67.96794382040547]
$LLM-DA$ is a novel data augmentation technique based on large language models (LLMs) for the few-shot NER task.
Our approach involves employing 14 contextual rewriting strategies, designing entity replacements of the same type, and incorporating noise injection to enhance robustness.
arXiv Detail & Related papers (2024-02-22T14:19:56Z) - Had enough of experts? Quantitative knowledge retrieval from large language models [4.091195951668217]
Large language models (LLMs) have been extensively studied for their abilities to generate convincing natural language sequences.<n>We introduce a framework that leverages LLMs to enhance Bayesian models by eliciting expert-like prior knowledge and imputing missing data.
arXiv Detail & Related papers (2024-02-12T16:32:37Z) - Evaluating, Understanding, and Improving Constrained Text Generation for Large Language Models [49.74036826946397]
This study investigates constrained text generation for large language models (LLMs)
Our research mainly focuses on mainstream open-source LLMs, categorizing constraints into lexical, structural, and relation-based types.
Results illuminate LLMs' capacity and deficiency to incorporate constraints and provide insights for future developments in constrained text generation.
arXiv Detail & Related papers (2023-10-25T03:58:49Z) - Bias and Fairness in Large Language Models: A Survey [73.87651986156006]
We present a comprehensive survey of bias evaluation and mitigation techniques for large language models (LLMs)
We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing.
We then unify the literature by proposing three intuitive, two for bias evaluation, and one for mitigation.
arXiv Detail & Related papers (2023-09-02T00:32:55Z) - Ground-Truth, Whose Truth? -- Examining the Challenges with Annotating
Toxic Text Datasets [26.486492641924226]
This study examines selected toxic text datasets with the goal of shedding light on some of the inherent issues.
We re-annotate samples from three toxic text datasets and find that a multi-label approach to annotating toxic text samples can help to improve dataset quality.
arXiv Detail & Related papers (2021-12-07T06:58:22Z) - Data Augmentation for Spoken Language Understanding via Pretrained
Language Models [113.56329266325902]
Training of spoken language understanding (SLU) models often faces the problem of data scarcity.
We put forward a data augmentation method using pretrained language models to boost the variability and accuracy of generated utterances.
arXiv Detail & Related papers (2020-04-29T04:07:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.