Are LLMs Good Literature Review Writers? Evaluating the Literature Review Writing Ability of Large Language Models
- URL: http://arxiv.org/abs/2412.13612v1
- Date: Wed, 18 Dec 2024 08:42:25 GMT
- Title: Are LLMs Good Literature Review Writers? Evaluating the Literature Review Writing Ability of Large Language Models
- Authors: Xuemei Tang, Xufeng Duan, Zhenguang G. Cai,
- Abstract summary: We propose a framework to assess the literature review writing ability of large language models automatically.<n>We evaluate the performance of LLMs across three tasks: generating references, writing abstracts, and writing literature reviews.
- Score: 2.048226951354646
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The literature review is a crucial form of academic writing that involves complex processes of literature collection, organization, and summarization. The emergence of large language models (LLMs) has introduced promising tools to automate these processes. However, their actual capabilities in writing comprehensive literature reviews remain underexplored, such as whether they can generate accurate and reliable references. To address this gap, we propose a framework to assess the literature review writing ability of LLMs automatically. We evaluate the performance of LLMs across three tasks: generating references, writing abstracts, and writing literature reviews. We employ external tools for a multidimensional evaluation, which includes assessing hallucination rates in references, semantic coverage, and factual consistency with human-written context. By analyzing the experimental results, we find that, despite advancements, even the most sophisticated models still cannot avoid generating hallucinated references. Additionally, different models exhibit varying performance in literature review writing across different disciplines.
Related papers
- Modelling and Classifying the Components of a Literature Review [0.0]
We present a novel benchmark comprising 700 sentences manually annotated by domain experts and 2,240 sentences automatically labelled using large language models (LLMs)<n>The experiments yield several novel insights that advance the state of the art in this challenging domain.
arXiv Detail & Related papers (2025-08-06T11:30:07Z) - Evaluating book summaries from internal knowledge in Large Language Models: a cross-model and semantic consistency approach [0.0]
We study the ability of large language models (LLMs) to generate comprehensive and accurate book summaries.
We examine whether these models can synthesize meaningful narratives that align with established human interpretations.
arXiv Detail & Related papers (2025-03-27T15:36:24Z) - Leveraging Large Language Models for Comparative Literature Summarization with Reflective Incremental Mechanisms [44.99833362998488]
ChatCite is a novel method leveraging large language models (LLMs) for generating comparative literature summaries.
We evaluate ChatCite on a custom dataset, CompLit-LongContext, consisting of 1000 research papers with annotated comparative summaries.
arXiv Detail & Related papers (2024-12-03T04:09:36Z) - Mixture of Knowledge Minigraph Agents for Literature Review Generation [22.80918934436901]
This paper proposes a novel framework, collaborative knowledge minigraph agents (CKMAs) to automate scholarly literature reviews.
A novel prompt-based algorithm, the knowledge minigraph construction agent (KMCA), is designed to identify relations between concepts from academic literature and automatically constructs knowledge minigraphs.
By leveraging the capabilities of large language models on constructed knowledge minigraphs, the multiple path summarization agent (MPSA) efficiently organizes concepts and relations from different viewpoints to generate literature review paragraphs.
arXiv Detail & Related papers (2024-11-09T12:06:40Z) - A Comparative Study of Quality Evaluation Methods for Text Summarization [0.5512295869673147]
This paper proposes a novel method based on large language models (LLMs) for evaluating text summarization.
Our results show that LLMs evaluation aligns closely with human evaluation, while widely-used automatic metrics such as ROUGE-2, BERTScore, and SummaC do not and also lack consistency.
arXiv Detail & Related papers (2024-06-30T16:12:37Z) - Inclusivity in Large Language Models: Personality Traits and Gender Bias in Scientific Abstracts [49.97673761305336]
We evaluate three large language models (LLMs) for their alignment with human narrative styles and potential gender biases.
Our findings indicate that, while these models generally produce text closely resembling human authored content, variations in stylistic features suggest significant gender biases.
arXiv Detail & Related papers (2024-06-27T19:26:11Z) - LFED: A Literary Fiction Evaluation Dataset for Large Language Models [58.85989777743013]
We collect 95 literary fictions that are either originally written in Chinese or translated into Chinese, covering a wide range of topics across several centuries.
We define a question taxonomy with 8 question categories to guide the creation of 1,304 questions.
We conduct an in-depth analysis to ascertain how specific attributes of literary fictions (e.g., novel types, character numbers, the year of publication) impact LLM performance in evaluations.
arXiv Detail & Related papers (2024-05-16T15:02:24Z) - ChatCite: LLM Agent with Human Workflow Guidance for Comparative
Literature Summary [30.409552944905915]
ChatCite is an LLM agent with human workflow guidance for comparative literature summary.
The ChatCite agent outperformed other models in various dimensions in the experiments.
The literature summaries generated by ChatCite can also be directly used for drafting literature reviews.
arXiv Detail & Related papers (2024-03-05T01:13:56Z) - Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers [25.268709339109893]
We evaluate recent Large Language Models (LLMs) on the challenging task of summarizing short stories.
We work directly with authors to ensure that the stories have not been shared online (and therefore are unseen by the models)
We compare GPT-4, Claude-2.1, and LLama-2-70B and find that all three models make faithfulness mistakes in over 50% of summaries.
arXiv Detail & Related papers (2024-03-02T01:52:14Z) - A Literature Review of Literature Reviews in Pattern Analysis and Machine Intelligence [55.33653554387953]
Pattern Analysis and Machine Intelligence (PAMI) has led to numerous literature reviews aimed at collecting and fragmented information.<n>This paper presents a thorough analysis of these literature reviews within the PAMI field.<n>We try to address three core research questions: (1) What are the prevalent structural and statistical characteristics of PAMI literature reviews; (2) What strategies can researchers employ to efficiently navigate the growing corpus of reviews; and (3) What are the advantages and limitations of AI-generated reviews compared to human-authored ones.
arXiv Detail & Related papers (2024-02-20T11:28:50Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - Evaluating Large Language Model Creativity from a Literary Perspective [13.672268920902187]
This paper assesses the potential for large language models to serve as assistive tools in the creative writing process.
We develop interactive and multi-voice prompting strategies that interleave background descriptions, instructions that guide composition, samples of text in the target style, and critical discussion of the given samples.
arXiv Detail & Related papers (2023-11-30T16:46:25Z) - BLESS: Benchmarking Large Language Models on Sentence Simplification [55.461555829492866]
We present BLESS, a performance benchmark of the most recent state-of-the-art large language models (LLMs) on the task of text simplification (TS)
We assess a total of 44 models, differing in size, architecture, pre-training methods, and accessibility, on three test sets from different domains (Wikipedia, news, and medical) under a few-shot setting.
Our evaluation indicates that the best LLMs, despite not being trained on TS, perform comparably with state-of-the-art TS baselines.
arXiv Detail & Related papers (2023-10-24T12:18:17Z) - Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs.
We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z) - Summarization is (Almost) Dead [49.360752383801305]
We develop new datasets and conduct human evaluation experiments to evaluate the zero-shot generation capability of large language models (LLMs)
Our findings indicate a clear preference among human evaluators for LLM-generated summaries over human-written summaries and summaries generated by fine-tuned models.
arXiv Detail & Related papers (2023-09-18T08:13:01Z) - Decoding the End-to-end Writing Trajectory in Scholarly Manuscripts [7.294418916091011]
We introduce a novel taxonomy that categorizes scholarly writing behaviors according to intention, writer actions, and the information types of the written data.
Motivated by cognitive writing theory, our taxonomy for scientific papers includes three levels of categorization in order to trace the general writing flow.
ManuScript intends to provide a complete picture of the scholarly writing process by capturing the linearity and non-linearity of writing trajectory.
arXiv Detail & Related papers (2023-03-31T20:33:03Z) - Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood.
We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.