Does a Large Language Model Really Speak in Human-Like Language?
- URL: http://arxiv.org/abs/2501.01273v1
- Date: Thu, 02 Jan 2025 14:13:44 GMT
- Title: Does a Large Language Model Really Speak in Human-Like Language?
- Authors: Mose Park, Yunjin Choi, Jong-June Jeon,
- Abstract summary: Large Language Models (LLMs) have recently emerged, attracting considerable attention due to their ability to generate highly natural, human-like text.<n>This study compares the latent community structures of LLM-generated text and human-written text.<n>Our results indicate that GPT-generated text remains distinct from human-authored text.
- Score: 0.5735035463793009
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have recently emerged, attracting considerable attention due to their ability to generate highly natural, human-like text. This study compares the latent community structures of LLM-generated text and human-written text within a hypothesis testing procedure. Specifically, we analyze three text sets: original human-written texts ($\mathcal{O}$), their LLM-paraphrased versions ($\mathcal{G}$), and a twice-paraphrased set ($\mathcal{S}$) derived from $\mathcal{G}$. Our analysis addresses two key questions: (1) Is the difference in latent community structures between $\mathcal{O}$ and $\mathcal{G}$ the same as that between $\mathcal{G}$ and $\mathcal{S}$? (2) Does $\mathcal{G}$ become more similar to $\mathcal{O}$ as the LLM parameter controlling text variability is adjusted? The first question is based on the assumption that if LLM-generated text truly resembles human language, then the gap between the pair ($\mathcal{O}$, $\mathcal{G}$) should be similar to that between the pair ($\mathcal{G}$, $\mathcal{S}$), as both pairs consist of an original text and its paraphrase. The second question examines whether the degree of similarity between LLM-generated and human text varies with changes in the breadth of text generation. To address these questions, we propose a statistical hypothesis testing framework that leverages the fact that each text has corresponding parts across all datasets due to their paraphrasing relationship. This relationship enables the mapping of one dataset's relative position to another, allowing two datasets to be mapped to a third dataset. As a result, both mapped datasets can be quantified with respect to the space characterized by the third dataset, facilitating a direct comparison between them. Our results indicate that GPT-generated text remains distinct from human-authored text.
Related papers
- QUDsim: Quantifying Discourse Similarities in LLM-Generated Text [70.22275200293964]
We introduce an abstraction based on linguistic theories in Questions Under Discussion (QUD) and question semantics to help quantify differences in discourse progression.
We then use this framework to build $textbfQUDsim$, a similarity metric that can detect discursive parallels between documents.
Using QUDsim, we find that LLMs often reuse discourse structures (more so than humans) across samples, even when content differs.
arXiv Detail & Related papers (2025-04-12T23:46:09Z) - Language Models May Verbatim Complete Text They Were Not Explicitly Trained On [97.3414396208613]
We show that a $n$-gram based membership definition can be effectively gamed.
We show that it is difficult to find a single viable choice of $n$ for membership definitions.
Our findings highlight the inadequacy of $n$-gram membership, suggesting membership definitions fail to account for auxiliary information.
arXiv Detail & Related papers (2025-03-21T19:57:04Z) - The Magnitude of Categories of Texts Enriched by Language Models [1.8416014644193064]
We use the next-token probabilities given by a language model to define a $[0,1]$-enrichment of a category of texts in natural language.
We compute the M"obius function and the magnitude of an associated generalized space $mathcalM$ of texts.
arXiv Detail & Related papers (2025-01-11T23:28:50Z) - Zero-Shot Statistical Tests for LLM-Generated Text Detection using Finite Sample Concentration Inequalities [13.657259851747126]
Verifying the provenance of content is crucial to the function of many organizations, e.g., educational institutions, social media platforms, firms, etc.
This problem is becoming increasingly challenging as text generated by Large Language Models (LLMs) becomes almost indistinguishable from human-generated content.
We show that our tests' type I and type II errors decrease exponentially as text length increases.
Practically, our work enables guaranteed finding of the origin of harmful or false LLM-generated text, which can be useful for combating misinformation and compliance with emerging AI regulations.
arXiv Detail & Related papers (2025-01-04T23:51:43Z) - Reasoning to Attend: Try to Understand How <SEG> Token Works [44.33848900059659]
We show that $texttSEG>$ token contributes to semantic similarity within image-text pairs.<n>We present READ, which facilitates LMMs' resilient $textbfREA$soning capability of where to atten$textbfD$ under the guidance of highly activated points.
arXiv Detail & Related papers (2024-12-23T17:44:05Z) - Federated UCBVI: Communication-Efficient Federated Regret Minimization with Heterogeneous Agents [13.391318494060975]
We present the Federated Upper Confidence Bound Value Iteration algorithm ($textttFed-UCBVI$)
We prove that the regret of $textttFed-UCBVI$ scales as $tildemathcalO(sqrtH3 |mathcalS| |mathcalA| T / M)$.
We show that, unlike existing federated reinforcement learning approaches, $textttFed-UCBVI$'s communication complexity only marginally increases with the number of
arXiv Detail & Related papers (2024-10-30T11:05:50Z) - Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG [57.14250086701313]
We investigate the extent to which modern LMs generate $n$-grams from their training data.
We develop Rusty-DAWG, a novel search tool inspired by indexing of genomic data.
arXiv Detail & Related papers (2024-06-18T21:31:19Z) - Creating an AI Observer: Generative Semantic Workspaces [4.031100721019478]
We introduce the $textbf[G]$enerative $textbf[S]$emantic $textbf[W]$orkspace (GSW))
GSW creates a generative-style Semantic framework, as opposed to a traditionally predefined set of lexicon labels.
arXiv Detail & Related papers (2024-06-07T00:09:13Z) - Transformer In-Context Learning for Categorical Data [51.23121284812406]
We extend research on understanding Transformers through the lens of in-context learning with functional data by considering categorical outcomes, nonlinear underlying models, and nonlinear attention.
We present what is believed to be the first real-world demonstration of this few-shot-learning methodology, using the ImageNet dataset.
arXiv Detail & Related papers (2024-05-27T15:03:21Z) - Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens [138.36729703589512]
We show that $n$-gram language models are still relevant in this era of neural large language models (LLMs)
This was done by modernizing $n$-gram LMs in two aspects. First, we train them at the same data scale as neural LLMs -- 5 trillion tokens.
Second, existing $n$-gram LMs use small $n$ which hinders their performance; we instead allow $n$ to be arbitrarily large, by introducing a new $infty$-gram LM with backoff.
arXiv Detail & Related papers (2024-01-30T19:03:49Z) - $\textit{Swap and Predict}$ -- Predicting the Semantic Changes in Words
across Corpora by Context Swapping [36.10628959436778]
We consider the problem of predicting whether a given target word, $w$, changes its meaning between two different text corpora.
We propose an unsupervised method that randomly swaps contexts between $mathcalC$ and $mathcalC$.
Our method achieves significant performance improvements compared to strong baselines for the English semantic change prediction task.
arXiv Detail & Related papers (2023-10-16T13:39:44Z) - Understanding Dataset Difficulty with $\mathcal{V}$-Usable Information [67.25713071340518]
Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans.
We frame dataset difficulty as the lack of $mathcalV$-$textitusable information.
We also introduce $textitpointwise $mathcalV$-information$ (PVI) for measuring the difficulty of individual instances.
arXiv Detail & Related papers (2021-10-16T00:21:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.