Does a Large Language Model Really Speak in Human-Like Language?
- URL: http://arxiv.org/abs/2501.01273v1
- Date: Thu, 02 Jan 2025 14:13:44 GMT
- Title: Does a Large Language Model Really Speak in Human-Like Language?
- Authors: Mose Park, Yunjin Choi, Jong-June Jeon,
- Abstract summary: Large Language Models (LLMs) have recently emerged, attracting considerable attention due to their ability to generate highly natural, human-like text.
This study compares the latent community structures of LLM-generated text and human-written text.
Our results indicate that GPT-generated text remains distinct from human-authored text.
- Score: 0.5735035463793009
- License:
- Abstract: Large Language Models (LLMs) have recently emerged, attracting considerable attention due to their ability to generate highly natural, human-like text. This study compares the latent community structures of LLM-generated text and human-written text within a hypothesis testing procedure. Specifically, we analyze three text sets: original human-written texts ($\mathcal{O}$), their LLM-paraphrased versions ($\mathcal{G}$), and a twice-paraphrased set ($\mathcal{S}$) derived from $\mathcal{G}$. Our analysis addresses two key questions: (1) Is the difference in latent community structures between $\mathcal{O}$ and $\mathcal{G}$ the same as that between $\mathcal{G}$ and $\mathcal{S}$? (2) Does $\mathcal{G}$ become more similar to $\mathcal{O}$ as the LLM parameter controlling text variability is adjusted? The first question is based on the assumption that if LLM-generated text truly resembles human language, then the gap between the pair ($\mathcal{O}$, $\mathcal{G}$) should be similar to that between the pair ($\mathcal{G}$, $\mathcal{S}$), as both pairs consist of an original text and its paraphrase. The second question examines whether the degree of similarity between LLM-generated and human text varies with changes in the breadth of text generation. To address these questions, we propose a statistical hypothesis testing framework that leverages the fact that each text has corresponding parts across all datasets due to their paraphrasing relationship. This relationship enables the mapping of one dataset's relative position to another, allowing two datasets to be mapped to a third dataset. As a result, both mapped datasets can be quantified with respect to the space characterized by the third dataset, facilitating a direct comparison between them. Our results indicate that GPT-generated text remains distinct from human-authored text.
Related papers
- The Magnitude of Categories of Texts Enriched by Language Models [1.8416014644193064]
We use the next-token probabilities given by a language model to define a $[0,1]$-enrichment of a category of texts in natural language.
We compute the M"obius function and the magnitude of an associated generalized space $mathcalM$ of texts.
arXiv Detail & Related papers (2025-01-11T23:28:50Z) - Zero-Shot Statistical Tests for LLM-Generated Text Detection using Finite Sample Concentration Inequalities [13.657259851747126]
We show that type I and type II errors for our tests decrease exponentially in the text length.
We show that if the string is generated by $A$, the log-perplexity of the string under $A$ converges to the average entropy of the string under $A$, except with an exponentially small probability in string length.
arXiv Detail & Related papers (2025-01-04T23:51:43Z) - Reasoning to Attend: Try to Understand How <SEG> Token Works [44.33848900059659]
We show that $texttSEG>$ token contributes to semantic similarity within image-text pairs.
We present READ, which facilitates LMMs' resilient $textbfREA$soning capability of where to atten$textbfD$ under the guidance of highly activated points.
arXiv Detail & Related papers (2024-12-23T17:44:05Z) - Reasoning Robustness of LLMs to Adversarial Typographical Errors [49.99118660264703]
Large Language Models (LLMs) have demonstrated impressive capabilities in reasoning using Chain-of-Thought (CoT) prompting.
We study the reasoning robustness of LLMs to typographical errors, which can naturally occur in users' queries.
We design an Adversarial Typo Attack ($texttATA$) algorithm that iteratively samples typos for words that are important to the query and selects the edit that is most likely to succeed in attacking.
arXiv Detail & Related papers (2024-11-08T05:54:05Z) - Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG [57.14250086701313]
We investigate the extent to which modern LMs generate $n$-grams from their training data.
We develop Rusty-DAWG, a novel search tool inspired by indexing of genomic data.
arXiv Detail & Related papers (2024-06-18T21:31:19Z) - Creating an AI Observer: Generative Semantic Workspaces [4.031100721019478]
We introduce the $textbf[G]$enerative $textbf[S]$emantic $textbf[W]$orkspace (GSW))
GSW creates a generative-style Semantic framework, as opposed to a traditionally predefined set of lexicon labels.
arXiv Detail & Related papers (2024-06-07T00:09:13Z) - Transformer In-Context Learning for Categorical Data [51.23121284812406]
We extend research on understanding Transformers through the lens of in-context learning with functional data by considering categorical outcomes, nonlinear underlying models, and nonlinear attention.
We present what is believed to be the first real-world demonstration of this few-shot-learning methodology, using the ImageNet dataset.
arXiv Detail & Related papers (2024-05-27T15:03:21Z) - Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens [138.36729703589512]
We show that $n$-gram language models are still relevant in this era of neural large language models (LLMs)
This was done by modernizing $n$-gram LMs in two aspects. First, we train them at the same data scale as neural LLMs -- 5 trillion tokens.
Second, existing $n$-gram LMs use small $n$ which hinders their performance; we instead allow $n$ to be arbitrarily large, by introducing a new $infty$-gram LM with backoff.
arXiv Detail & Related papers (2024-01-30T19:03:49Z) - $\textit{Swap and Predict}$ -- Predicting the Semantic Changes in Words
across Corpora by Context Swapping [36.10628959436778]
We consider the problem of predicting whether a given target word, $w$, changes its meaning between two different text corpora.
We propose an unsupervised method that randomly swaps contexts between $mathcalC$ and $mathcalC$.
Our method achieves significant performance improvements compared to strong baselines for the English semantic change prediction task.
arXiv Detail & Related papers (2023-10-16T13:39:44Z) - Supervised Training of Conditional Monge Maps [107.78770597815242]
Optimal transport (OT) theory describes general principles to define and select, among many possible choices, the most efficient way to map a probability measure onto another.
We introduce CondOT, a multi-task approach to estimate a family of OT maps conditioned on a context variable.
We demonstrate the ability of CondOT to infer the effect of an arbitrary combination of genetic or therapeutic perturbations on single cells.
arXiv Detail & Related papers (2022-06-28T19:34:44Z) - Taking a hint: How to leverage loss predictors in contextual bandits? [63.546913998407405]
We study learning in contextual bandits with the help of loss predictors.
We show that the optimal regret is $mathcalO(minsqrtT, sqrtmathcalETfrac13)$ when $mathcalE$ is known.
arXiv Detail & Related papers (2020-03-04T07:36:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.