Delving into ChatGPT usage in academic writing through excess vocabulary
- URL: http://arxiv.org/abs/2406.07016v4
- Date: Wed, 19 Feb 2025 22:15:24 GMT
- Title: Delving into ChatGPT usage in academic writing through excess vocabulary
- Authors: Dmitry Kobak, Rita González-Márquez, Emőke-Ágnes Horvát, Jan Lause,
- Abstract summary: Large language models (LLMs) like ChatGPT can generate and revise text with human-level performance.<n>We study vocabulary changes in 14 million PubMed abstracts from 2010--2024, and show how the appearance of LLMs led to an abrupt increase in the frequency of certain style words.<n>We show that LLMs have had an unprecedented impact on the scientific literature, surpassing the effect of major world events such as the Covid pandemic.
- Score: 4.58733012283457
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large language models (LLMs) like ChatGPT can generate and revise text with human-level performance. These models come with clear limitations: they can produce inaccurate information, reinforce existing biases, and be easily misused. Yet, many scientists use them for their scholarly writing. But how wide-spread is such LLM usage in the academic literature? To answer this question, we present an unbiased, large-scale approach: we study vocabulary changes in 14 million PubMed abstracts from 2010--2024, and show how the appearance of LLMs led to an abrupt increase in the frequency of certain style words. This excess word analysis suggests that at least 10% of 2024 abstracts were processed with LLMs. This lower bound differed across disciplines, countries, and journals, reaching 30% for some sub-corpora. We show that LLMs have had an unprecedented impact on the scientific literature, surpassing the effect of major world events such as the Covid pandemic.
Related papers
- ChatGPT as Linguistic Equalizer? Quantifying LLM-Driven Lexical Shifts in Academic Writing [2.0117661599862164]
This study investigates whether ChatGPT mitigates barriers and fosters equity by analyzing lexical complexity shifts across 2.8 million articles from OpenAlex ( 2020-2024)
We demonstrate that ChatGPT significantly enhances lexical complexity in NNES-authored abstracts, even after controlling for article-level controls, authorship patterns, and venue norms.
These findings provide causal evidence that ChatGPT reduces linguistic disparities and promotes equity in global academia.
arXiv Detail & Related papers (2025-04-10T14:11:24Z) - Human-LLM Coevolution: Evidence from Academic Writing [0.0]
We report a marked drop in the frequency of several words previously identified as overused by ChatGPT, such as "delve"
The frequency of certain other words favored by ChatGPT, such as "significant", has instead kept increasing.
arXiv Detail & Related papers (2025-02-13T18:55:56Z) - Why Does ChatGPT "Delve" So Much? Exploring the Sources of Lexical Overrepresentation in Large Language Models [0.0]
It is widely assumed that scientists' use of large language models (LLMs) is responsible for language change.
We develop a formal, transferable method to characterize these linguistic changes.
We find 21 focal words whose increased occurrence in scientific abstracts is likely the result of LLM usage.
We assess whether reinforcement learning from human feedback contributes to the overuse of focal words.
arXiv Detail & Related papers (2024-12-16T02:27:59Z) - Do LLMs write like humans? Variation in grammatical and rhetorical styles [0.7852714805965528]
We study the rhetorical styles of large language models (LLMs)
Using Douglas Biber's set of lexical, grammatical, and rhetorical features, we identify systematic differences between LLMs and humans.
This demonstrates that despite their advanced abilities, LLMs struggle to match human styles.
arXiv Detail & Related papers (2024-10-21T15:35:44Z) - Transforming Scholarly Landscapes: Influence of Large Language Models on Academic Fields beyond Computer Science [77.31665252336157]
Large Language Models (LLMs) have ushered in a transformative era in Natural Language Processing (NLP)
This work empirically examines the influence and use of LLMs in fields beyond NLP.
arXiv Detail & Related papers (2024-09-29T01:32:35Z) - The Impact of Large Language Models in Academia: from Writing to Speaking [42.1505375956748]
We examined and compared the words used in writing and speaking based on more than 30,000 papers and 1,000 presentations from machine learning conferences.
Our results show that LLM-style words such as "significant" have been used more frequently in abstracts and oral presentations.
The impact on speaking is beginning to emerge and is likely to grow in the future, calling attention to the implicit influence and ripple effect of LLMs on human society.
arXiv Detail & Related papers (2024-09-20T17:54:16Z) - LFED: A Literary Fiction Evaluation Dataset for Large Language Models [58.85989777743013]
We collect 95 literary fictions that are either originally written in Chinese or translated into Chinese, covering a wide range of topics across several centuries.
We define a question taxonomy with 8 question categories to guide the creation of 1,304 questions.
We conduct an in-depth analysis to ascertain how specific attributes of literary fictions (e.g., novel types, character numbers, the year of publication) impact LLM performance in evaluations.
arXiv Detail & Related papers (2024-05-16T15:02:24Z) - Is ChatGPT Transforming Academics' Writing Style? [0.0]
Based on one million arXiv papers submitted from May 2018 to January 2024, we assess the textual density of ChatGPT's writing style in their abstracts.
We find that large language models (LLMs), represented by ChatGPT, are having an increasing impact on arXiv abstracts.
arXiv Detail & Related papers (2024-04-12T17:41:05Z) - Mapping the Increasing Use of LLMs in Scientific Papers [99.67983375899719]
We conduct the first systematic, large-scale analysis across 950,965 papers published between January 2020 and February 2024 on the arXiv, bioRxiv, and Nature portfolio journals.
Our findings reveal a steady increase in LLM usage, with the largest and fastest growth observed in Computer Science papers.
arXiv Detail & Related papers (2024-04-01T17:45:15Z) - Beware of Words: Evaluating the Lexical Diversity of Conversational LLMs using ChatGPT as Case Study [3.0059120458540383]
We consider the evaluation of the lexical richness of the text generated by conversational Large Language Models (LLMs) and how it depends on the model parameters.
The results show how lexical richness depends on the version of ChatGPT and some of its parameters, such as the presence penalty, or on the role assigned to the model.
arXiv Detail & Related papers (2024-02-11T13:41:17Z) - Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks.
LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z) - A Comprehensive Survey of Hallucination Mitigation Techniques in Large
Language Models [7.705767540805267]
Large Language Models (LLMs) continue to advance in their ability to write human-like text.
A key challenge remains around their tendency to hallucinate generating content that appears factual but is ungrounded.
This paper presents a survey of over 32 techniques developed to mitigate hallucination in LLMs.
arXiv Detail & Related papers (2024-01-02T17:56:30Z) - How should the advent of large language models affect the practice of
science? [51.62881233954798]
How should the advent of large language models affect the practice of science?
We have invited four diverse groups of scientists to reflect on this query, sharing their perspectives and engaging in debate.
arXiv Detail & Related papers (2023-12-05T10:45:12Z) - The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics [74.99898531299148]
This research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency.
We apply two languages to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different language families and sizes.
It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed.
arXiv Detail & Related papers (2023-11-16T09:35:50Z) - "Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in
LLM-Generated Reference Letters [97.11173801187816]
Large Language Models (LLMs) have recently emerged as an effective tool to assist individuals in writing various types of content.
This paper critically examines gender biases in LLM-generated reference letters.
arXiv Detail & Related papers (2023-10-13T16:12:57Z) - Chain-of-Dictionary Prompting Elicits Translation in Large Language Models [100.47154959254937]
Large language models (LLMs) have shown surprisingly good performance in multilingual neural machine translation (MNMT)
We present a novel method, CoD, which augments LLMs with prior knowledge with the chains of multilingual dictionaries for a subset of input words to elicit translation abilities.
arXiv Detail & Related papers (2023-05-11T05:19:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.