Delving into ChatGPT usage in academic writing through excess vocabulary
- URL: http://arxiv.org/abs/2406.07016v4
- Date: Wed, 19 Feb 2025 22:15:24 GMT
- Title: Delving into ChatGPT usage in academic writing through excess vocabulary
- Authors: Dmitry Kobak, Rita González-Márquez, Emőke-Ágnes Horvát, Jan Lause,
- Abstract summary: Large language models (LLMs) like ChatGPT can generate and revise text with human-level performance.
We study vocabulary changes in 14 million PubMed abstracts from 2010--2024, and show how the appearance of LLMs led to an abrupt increase in the frequency of certain style words.
We show that LLMs have had an unprecedented impact on the scientific literature, surpassing the effect of major world events such as the Covid pandemic.
- Score: 4.58733012283457
- License:
- Abstract: Large language models (LLMs) like ChatGPT can generate and revise text with human-level performance. These models come with clear limitations: they can produce inaccurate information, reinforce existing biases, and be easily misused. Yet, many scientists use them for their scholarly writing. But how wide-spread is such LLM usage in the academic literature? To answer this question, we present an unbiased, large-scale approach: we study vocabulary changes in 14 million PubMed abstracts from 2010--2024, and show how the appearance of LLMs led to an abrupt increase in the frequency of certain style words. This excess word analysis suggests that at least 10% of 2024 abstracts were processed with LLMs. This lower bound differed across disciplines, countries, and journals, reaching 30% for some sub-corpora. We show that LLMs have had an unprecedented impact on the scientific literature, surpassing the effect of major world events such as the Covid pandemic.
Related papers
- Human-LLM Coevolution: Evidence from Academic Writing [0.0]
We report a marked drop in the frequency of several words previously identified as overused by ChatGPT, such as "delve"
The frequency of certain other words favored by ChatGPT, such as "significant", has instead kept increasing.
arXiv Detail & Related papers (2025-02-13T18:55:56Z) - Why Does ChatGPT "Delve" So Much? Exploring the Sources of Lexical Overrepresentation in Large Language Models [0.0]
It is widely assumed that scientists' use of large language models (LLMs) is responsible for language change.
We develop a formal, transferable method to characterize these linguistic changes.
We find 21 focal words whose increased occurrence in scientific abstracts is likely the result of LLM usage.
We assess whether reinforcement learning from human feedback contributes to the overuse of focal words.
arXiv Detail & Related papers (2024-12-16T02:27:59Z) - The Impact of Large Language Models in Academia: from Writing to Speaking [42.1505375956748]
We examined and compared the words used in writing and speaking based on more than 30,000 papers and 1,000 presentations from machine learning conferences.
Our results show that LLM-style words such as "significant" have been used more frequently in abstracts and oral presentations.
The impact on speaking is beginning to emerge and is likely to grow in the future, calling attention to the implicit influence and ripple effect of LLMs on human society.
arXiv Detail & Related papers (2024-09-20T17:54:16Z) - LFED: A Literary Fiction Evaluation Dataset for Large Language Models [58.85989777743013]
We collect 95 literary fictions that are either originally written in Chinese or translated into Chinese, covering a wide range of topics across several centuries.
We define a question taxonomy with 8 question categories to guide the creation of 1,304 questions.
We conduct an in-depth analysis to ascertain how specific attributes of literary fictions (e.g., novel types, character numbers, the year of publication) impact LLM performance in evaluations.
arXiv Detail & Related papers (2024-05-16T15:02:24Z) - Is ChatGPT Transforming Academics' Writing Style? [0.0]
Based on one million arXiv papers submitted from May 2018 to January 2024, we assess the textual density of ChatGPT's writing style in their abstracts.
We find that large language models (LLMs), represented by ChatGPT, are having an increasing impact on arXiv abstracts.
arXiv Detail & Related papers (2024-04-12T17:41:05Z) - Mapping the Increasing Use of LLMs in Scientific Papers [99.67983375899719]
We conduct the first systematic, large-scale analysis across 950,965 papers published between January 2020 and February 2024 on the arXiv, bioRxiv, and Nature portfolio journals.
Our findings reveal a steady increase in LLM usage, with the largest and fastest growth observed in Computer Science papers.
arXiv Detail & Related papers (2024-04-01T17:45:15Z) - Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks.
LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z) - A Comprehensive Survey of Hallucination Mitigation Techniques in Large
Language Models [7.705767540805267]
Large Language Models (LLMs) continue to advance in their ability to write human-like text.
A key challenge remains around their tendency to hallucinate generating content that appears factual but is ungrounded.
This paper presents a survey of over 32 techniques developed to mitigate hallucination in LLMs.
arXiv Detail & Related papers (2024-01-02T17:56:30Z) - How should the advent of large language models affect the practice of
science? [51.62881233954798]
How should the advent of large language models affect the practice of science?
We have invited four diverse groups of scientists to reflect on this query, sharing their perspectives and engaging in debate.
arXiv Detail & Related papers (2023-12-05T10:45:12Z) - The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics [74.99898531299148]
This research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency.
We apply two languages to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different language families and sizes.
It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed.
arXiv Detail & Related papers (2023-11-16T09:35:50Z) - Chain-of-Dictionary Prompting Elicits Translation in Large Language Models [100.47154959254937]
Large language models (LLMs) have shown surprisingly good performance in multilingual neural machine translation (MNMT)
We present a novel method, CoD, which augments LLMs with prior knowledge with the chains of multilingual dictionaries for a subset of input words to elicit translation abilities.
arXiv Detail & Related papers (2023-05-11T05:19:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.