Related papers: How Well Do LLMs Imitate Human Writing Style?

How Well Do LLMs Imitate Human Writing Style?

URL: http://arxiv.org/abs/2509.24930v1
Date: Mon, 29 Sep 2025 15:34:40 GMT
Title: How Well Do LLMs Imitate Human Writing Style?
Authors: Rebira Jemama, Rajesh Kumar,
Abstract summary: Large language models (LLMs) can generate fluent text, but their ability to replicate the distinctive style of a specific human author remains unclear.<n>We present a fast, training-free framework for authorship verification and style imitation analysis.<n>It achieves 97.5% accuracy on academic essays and 94.5% in cross-domain evaluation.
Score: 2.3754840025365183
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large language models (LLMs) can generate fluent text, but their ability to replicate the distinctive style of a specific human author remains unclear. We present a fast, training-free framework for authorship verification and style imitation analysis. The method integrates TF-IDF character n-grams with transformer embeddings and classifies text pairs through empirical distance distributions, eliminating the need for supervised training or threshold tuning. It achieves 97.5\% accuracy on academic essays and 94.5\% in cross-domain evaluation, while reducing training time by 91.8\% and memory usage by 59\% relative to parameter-based baselines. Using this framework, we evaluate five LLMs from three separate families (Llama, Qwen, Mixtral) across four prompting strategies - zero-shot, one-shot, few-shot, and text completion. Results show that the prompting strategy has a more substantial influence on style fidelity than model size: few-shot prompting yields up to 23.5x higher style-matching accuracy than zero-shot, and completion prompting reaches 99.9\% agreement with the original author's style. Crucially, high-fidelity imitation does not imply human-like unpredictability - human essays average a perplexity of 29.5, whereas matched LLM outputs average only 15.2. These findings demonstrate that stylistic fidelity and statistical detectability are separable, establishing a reproducible basis for future work in authorship modeling, detection, and identity-conditioned generation.

Related papers

StyleDecipher: Robust and Explainable Detection of LLM-Generated Texts with Stylistic Analysis [18.44456241158174]
StyleDecipher is a robust and explainable detection framework.<n>It revisits text detection using combined feature extractors to quantify stylistic differences.<n>It consistently achieves state-of-the-art in-domain accuracy.
arXiv Detail & Related papers (2025-10-14T15:07:27Z)
Catch Me If You Can? Not Yet: LLMs Still Struggle to Imitate the Implicit Writing Styles of Everyday Authors [9.921537507947473]
This work presents a comprehensive evaluation of large language models' ability to mimic personal writing styles.<n>We introduce an ensemble of complementary metrics-including authorship attribution, authorship verification, style matching, and AI detection-to robustly assess style imitation.<n>Results show that while LLMs can approximate user styles in structured formats like news and email, they struggle with nuanced, informal writing in blogs and forums.
arXiv Detail & Related papers (2025-09-18T02:18:49Z)
SCOPE: A Self-supervised Framework for Improving Faithfulness in Conditional Text Generation [55.61004653386632]
Large Language Models (LLMs) often produce hallucinations, i.e., information that is unfaithful or not grounded in the input context.<n>This paper introduces a novel self-supervised method for generating a training set of unfaithful samples.<n>We then refine the model using a training process that encourages the generation of grounded outputs over unfaithful ones.
arXiv Detail & Related papers (2025-02-19T12:31:58Z)
Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing [71.29488677105127]
Existing scene text recognition (STR) methods struggle to recognize challenging texts, especially for artistic and severely distorted characters. We propose a contrastive learning-based STR framework by leveraging synthetic and real unlabeled data without any human cost. Our method achieves SOTA performance (94.7% and 70.9% average accuracy on common benchmarks and Union14M-Benchmark.
arXiv Detail & Related papers (2024-11-23T15:24:47Z)
A Bayesian Approach to Harnessing the Power of LLMs in Authorship Attribution [57.309390098903]
Authorship attribution aims to identify the origin or author of a document. Large Language Models (LLMs) with their deep reasoning capabilities and ability to maintain long-range textual associations offer a promising alternative. Our results on the IMDb and blog datasets show an impressive 85% accuracy in one-shot authorship classification across ten authors.
arXiv Detail & Related papers (2024-10-29T04:14:23Z)
Better Zero-Shot Reasoning with Role-Play Prompting [10.90357246745529]
Role-play prompting consistently surpasses the standard zero-shot approach across most datasets. This highlights its potential to augment the reasoning capabilities of large language models.
arXiv Detail & Related papers (2023-08-15T11:08:30Z)
Text Classification via Large Language Models [63.1874290788797]
We introduce Clue And Reasoning Prompting (CARP) to address complex linguistic phenomena involved in text classification. Remarkably, CARP yields new SOTA performances on 4 out of 5 widely-used text-classification benchmarks. More importantly, we find that CARP delivers impressive abilities on low-resource and domain-adaptation setups.
arXiv Detail & Related papers (2023-05-15T06:24:45Z)
PART: Pre-trained Authorship Representation Transformer [52.623051272843426]
Authors writing documents imprint identifying information within their texts.<n>Previous works use hand-crafted features or classification tasks to train their authorship models.<n>We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z)
Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model. In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z)
MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics [55.85042753772513]
We introduce a benchmark for training and evaluating generative reading comprehension metrics: MOdeling Correctness with Human. s. Using MOCHA, we train a Learned Evaluation metric for Reading Pearson, LERC, to mimic human judgement scores. LERC outperforms baseline metrics by 10 to 36 absolute points on held-out annotations. When we evaluate on minimal pairs, LERC achieves 80% accuracy, outperforming baselines by 14 to 26 absolute percentage points while leaving significant room for improvement.
arXiv Detail & Related papers (2020-10-07T20:22:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.