Related papers: Linguistic Characteristics of AI-Generated Text: A Survey

Linguistic Characteristics of AI-Generated Text: A Survey

URL: http://arxiv.org/abs/2510.05136v1
Date: Wed, 01 Oct 2025 05:44:28 GMT
Title: Linguistic Characteristics of AI-Generated Text: A Survey
Authors: Luka Terčon, Kaja Dobrovoljc,
Abstract summary: Large language models (LLMs) are solidifying their position in the modern world as effective tools for the automatic generation of text.<n>There is a growing need to study the linguistic features present in AI-generated text.
Score: 0.3007949058551534
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language models (LLMs) are solidifying their position in the modern world as effective tools for the automatic generation of text. Their use is quickly becoming commonplace in fields such as education, healthcare, and scientific research. There is a growing need to study the linguistic features present in AI-generated text, as the increasing presence of such texts has profound implications in various disciplines such as corpus linguistics, computational linguistics, and natural language processing. Many observations have already been made, however a broader synthesis of the findings made so far is required to provide a better understanding of the topic. The present survey paper aims to provide such a synthesis of extant research. We categorize the existing works along several dimensions, including the levels of linguistic description, the models included, the genres analyzed, the languages analyzed, and the approach to prompting. Additionally, the same scheme is used to present the findings made so far and expose the current trends followed by researchers. Among the most-often reported findings is the observation that AI-generated text is more likely to contain a more formal and impersonal style, signaled by the increased presence of nouns, determiners, and adpositions and the lower reliance on adjectives and adverbs. AI-generated text is also more likely to feature a lower lexical diversity, a smaller vocabulary size, and repetitive text. Current research, however, remains heavily concentrated on English data and mostly on text generated by the GPT model family, highlighting the need for broader cross-linguistic and cross-model investigation. In most cases authors also fail to address the issue of prompt sensitivity, leaving much room for future studies that employ multiple prompt wordings in the text generation phase.

Related papers

Beyond checkmate: exploring the creative chokepoints in AI text [9.65404451340112]
We study portraying the nuanced distinctions between human and AI texts across text segments (introduction, body, and conclusion)<n>Our findings provide fresh insights into human-AI text differences and pave the way for more effective and interpretable detection strategies.
arXiv Detail & Related papers (2025-01-31T16:57:01Z)
Analysis of Plan-based Retrieval for Grounded Text Generation [78.89478272104739]
hallucinations occur when a language model is given a generation task outside its parametric knowledge. A common strategy to address this limitation is to infuse the language models with retrieval mechanisms. We analyze how planning can be used to guide retrieval to further reduce the frequency of hallucinations.
arXiv Detail & Related papers (2024-08-20T02:19:35Z)
Differentiating between human-written and AI-generated texts using linguistic features automatically extracted from an online computational tool [0.0]
This study aims to investigate how various linguistic components are represented in both types of texts, assessing the ability of AI to emulate human writing. Despite AI-generated texts appearing to mimic human speech, the results revealed significant differences across multiple linguistic features.
arXiv Detail & Related papers (2024-07-04T05:37:09Z)
Deep dive into language traits of AI-generated Abstracts [5.209583971923267]
Generative language models, such as ChatGPT, have garnered attention for their ability to generate human-like writing. In this work, we attempt to detect the Abstracts generated by ChatGPT, which are much shorter in length and bounded. We extract the texts semantic and lexical properties and observe that traditional machine learning models can confidently detect these Abstracts.
arXiv Detail & Related papers (2023-12-17T06:03:33Z)
Towards Possibilities & Impossibilities of AI-generated Text Detection: A Survey [97.33926242130732]
Large Language Models (LLMs) have revolutionized the domain of natural language processing (NLP) with remarkable capabilities of generating human-like text responses. Despite these advancements, several works in the existing literature have raised serious concerns about the potential misuse of LLMs. To address these concerns, a consensus among the research community is to develop algorithmic solutions to detect AI-generated text.
arXiv Detail & Related papers (2023-10-23T18:11:32Z)
Automatic and Human-AI Interactive Text Generation [27.05024520190722]
This tutorial aims to provide an overview of the state-of-the-art natural language generation research. Text-to-text generation tasks are more constrained in terms of semantic consistency and targeted language styles.
arXiv Detail & Related papers (2023-10-05T20:26:15Z)
The Imitation Game: Detecting Human and AI-Generated Texts in the Era of ChatGPT and BARD [3.2228025627337864]
We introduce a novel dataset of human-written and AI-generated texts in different genres. We employ several machine learning models to classify the texts. Results demonstrate the efficacy of these models in discerning between human and AI-generated text.
arXiv Detail & Related papers (2023-07-22T21:00:14Z)
Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions. This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z)
An Inclusive Notion of Text [69.36678873492373]
We argue that clarity on the notion of text is crucial for reproducible and generalizable NLP. We introduce a two-tier taxonomy of linguistic and non-linguistic elements that are available in textual sources and can be used in NLP modeling.
arXiv Detail & Related papers (2022-11-10T14:26:43Z)
Survey of Hallucination in Natural Language Generation [69.9926849848132]
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies. Deep learning based generation is prone to hallucinate unintended text, which degrades the system performance. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
arXiv Detail & Related papers (2022-02-08T03:55:01Z)
A Survey on Retrieval-Augmented Text Generation [53.04991859796971]
Retrieval-augmented text generation has remarkable advantages and has achieved state-of-the-art performance in many NLP tasks. It firstly highlights the generic paradigm of retrieval-augmented generation, and then it reviews notable approaches according to different tasks.
arXiv Detail & Related papers (2022-02-02T16:18:41Z)
How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN [63.79300884115027]
Current language models can generate high-quality text. Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions? We introduce RAVEN, a suite of analyses for assessing the novelty of generated text.
arXiv Detail & Related papers (2021-11-18T04:07:09Z)
Positioning yourself in the maze of Neural Text Generation: A Task-Agnostic Survey [54.34370423151014]
This paper surveys the components of modeling approaches relaying task impacts across various generation tasks such as storytelling, summarization, translation etc. We present an abstraction of the imperative techniques with respect to learning paradigms, pretraining, modeling approaches, decoding and the key challenges outstanding in the field in each of them.
arXiv Detail & Related papers (2020-10-14T17:54:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.