Playing with Words: Comparing the Vocabulary and Lexical Richness of
ChatGPT and Humans
- URL: http://arxiv.org/abs/2308.07462v2
- Date: Thu, 31 Aug 2023 11:09:16 GMT
- Title: Playing with Words: Comparing the Vocabulary and Lexical Richness of
ChatGPT and Humans
- Authors: Pedro Reviriego, Javier Conde, Elena Merino-G\'omez, Gonzalo
Mart\'inez, Jos\'e Alberto Hern\'andez
- Abstract summary: generative language models such as ChatGPT have triggered a revolution that can transform how text is generated.
Will the use of tools such as ChatGPT increase or reduce the vocabulary used or the lexical richness?
This has implications for words, as those not included in AI-generated content will tend to be less and less popular and may eventually be lost.
- Score: 3.0059120458540383
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The introduction of Artificial Intelligence (AI) generative language models
such as GPT (Generative Pre-trained Transformer) and tools such as ChatGPT has
triggered a revolution that can transform how text is generated. This has many
implications, for example, as AI-generated text becomes a significant fraction
of the text, would this have an effect on the language capabilities of readers
and also on the training of newer AI tools? Would it affect the evolution of
languages? Focusing on one specific aspect of the language: words; will the use
of tools such as ChatGPT increase or reduce the vocabulary used or the lexical
richness? This has implications for words, as those not included in
AI-generated content will tend to be less and less popular and may eventually
be lost. In this work, we perform an initial comparison of the vocabulary and
lexical richness of ChatGPT and humans when performing the same tasks. In more
detail, two datasets containing the answers to different types of questions
answered by ChatGPT and humans, and a third dataset in which ChatGPT
paraphrases sentences and questions are used. The analysis shows that ChatGPT
tends to use fewer distinct words and lower lexical richness than humans. These
results are very preliminary and additional datasets and ChatGPT configurations
have to be evaluated to extract more general conclusions. Therefore, further
research is needed to understand how the use of ChatGPT and more broadly
generative AI tools will affect the vocabulary and lexical richness in
different types of text and languages.
Related papers
- Towards Human Understanding of Paraphrase Types in ChatGPT [7.662751948664846]
Atomic paraphrase types (APT) decompose paraphrases into different linguistic changes.
We introduce APTY (Atomic Paraphrase TYpes), a dataset of 500 sentence-level and word-level annotations by 15 annotators.
Our results reveal that ChatGPT can generate simple APTs, but struggle with complex structures.
arXiv Detail & Related papers (2024-07-02T14:35:10Z) - Primacy Effect of ChatGPT [69.49920102917598]
We study the primacy effect of ChatGPT: the tendency of selecting the labels at earlier positions as the answer.
We hope that our experiments and analyses provide additional insights into building more reliable ChatGPT-based solutions.
arXiv Detail & Related papers (2023-10-20T00:37:28Z) - How many words does ChatGPT know? The answer is ChatWords [5.906689377130112]
evaluating the performance of ChatGPT and similar AI tools is a complex issue that is being explored from different perspectives.
We contribute to those efforts with ChatWords, an automated test system to evaluate ChatGPT knowledge of an arbitrary set of words.
Results show that ChatGPT is only able to recognize approximately 80% of the words in the dictionary and 90% of the words in the Quixote, in some cases with an incorrect meaning.
arXiv Detail & Related papers (2023-09-28T18:13:02Z) - Is ChatGPT Involved in Texts? Measure the Polish Ratio to Detect
ChatGPT-Generated Text [48.36706154871577]
We introduce a novel dataset termed HPPT (ChatGPT-polished academic abstracts)
It diverges from extant corpora by comprising pairs of human-written and ChatGPT-polished abstracts instead of purely ChatGPT-generated texts.
We also propose the "Polish Ratio" method, an innovative measure of the degree of modification made by ChatGPT compared to the original human-written text.
arXiv Detail & Related papers (2023-07-21T06:38:37Z) - Leveraging ChatGPT As Text Annotation Tool For Sentiment Analysis [6.596002578395151]
ChatGPT is a new product of OpenAI and has emerged as the most popular AI product.
This study explores the use of ChatGPT as a tool for data labeling for different sentiment analysis tasks.
arXiv Detail & Related papers (2023-06-18T12:20:42Z) - Uncovering the Potential of ChatGPT for Discourse Analysis in Dialogue:
An Empirical Study [51.079100495163736]
This paper systematically inspects ChatGPT's performance in two discourse analysis tasks: topic segmentation and discourse parsing.
ChatGPT demonstrates proficiency in identifying topic structures in general-domain conversations yet struggles considerably in specific-domain conversations.
Our deeper investigation indicates that ChatGPT can give more reasonable topic structures than human annotations but only linearly parses the hierarchical rhetorical structures.
arXiv Detail & Related papers (2023-05-15T07:14:41Z) - Differentiate ChatGPT-generated and Human-written Medical Texts [8.53416950968806]
This research is among the first studies on responsible and ethical AIGC (Artificial Intelligence Generated Content) in medicine.
We focus on analyzing the differences between medical texts written by human experts and generated by ChatGPT.
In the next step, we analyze the linguistic features of these two types of content and uncover differences in vocabulary, part-of-speech, dependency, sentiment, perplexity, etc.
arXiv Detail & Related papers (2023-04-23T07:38:07Z) - ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large
Language Models in Multilingual Learning [70.57126720079971]
Large language models (LLMs) have emerged as the most important breakthroughs in natural language processing (NLP)
This paper evaluates ChatGPT on 7 different tasks, covering 37 diverse languages with high, medium, low, and extremely low resources.
Compared to the performance of previous models, our extensive experimental results demonstrate a worse performance of ChatGPT for different NLP tasks and languages.
arXiv Detail & Related papers (2023-04-12T05:08:52Z) - To ChatGPT, or not to ChatGPT: That is the question! [78.407861566006]
This study provides a comprehensive and contemporary assessment of the most recent techniques in ChatGPT detection.
We have curated a benchmark dataset consisting of prompts from ChatGPT and humans, including diverse questions from medical, open Q&A, and finance domains.
Our evaluation results demonstrate that none of the existing methods can effectively detect ChatGPT-generated content.
arXiv Detail & Related papers (2023-04-04T03:04:28Z) - Is ChatGPT A Good Keyphrase Generator? A Preliminary Study [51.863368917344864]
ChatGPT has recently garnered significant attention from the computational linguistics community.
We evaluate its performance in various aspects, including keyphrase generation prompts, keyphrase generation diversity, and long document understanding.
We find that ChatGPT performs exceptionally well on all six candidate prompts, with minor performance differences observed across the datasets.
arXiv Detail & Related papers (2023-03-23T02:50:38Z) - ChatGPT or Human? Detect and Explain. Explaining Decisions of Machine
Learning Model for Detecting Short ChatGPT-generated Text [2.0378492681344493]
We study whether a machine learning model can be effectively trained to accurately distinguish between original human and seemingly human (that is, ChatGPT-generated) text.
We employ an explainable artificial intelligence framework to gain insight into the reasoning behind the model trained to differentiate between ChatGPT-generated and human-generated text.
Our study focuses on short online reviews, conducting two experiments comparing human-generated and ChatGPT-generated text.
arXiv Detail & Related papers (2023-01-30T08:06:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.