Related papers: The emergence of Large Language Models (LLM) as a tool in literature reviews: an LLM automated systematic review

The emergence of Large Language Models (LLM) as a tool in literature reviews: an LLM automated systematic review

URL: http://arxiv.org/abs/2409.04600v1
Date: Fri, 6 Sep 2024 20:12:57 GMT
Title: The emergence of Large Language Models (LLM) as a tool in literature reviews: an LLM automated systematic review
Authors: Dmitry Scherbakov, Nina Hubig, Vinita Jansari, Alexander Bakumenko, Leslie A. Lenert,
Abstract summary: This study aims to summarize the usage of Large Language Models (LLMs) in the process of creating a scientific review. We look at the range of stages in a review that can be automated and assess the current state-of-the-art research projects in the field.
Score: 42.112100361891905
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Objective: This study aims to summarize the usage of Large Language Models (LLMs) in the process of creating a scientific review. We look at the range of stages in a review that can be automated and assess the current state-of-the-art research projects in the field. Materials and Methods: The search was conducted in June 2024 in PubMed, Scopus, Dimensions, and Google Scholar databases by human reviewers. Screening and extraction process took place in Covidence with the help of LLM add-on which uses OpenAI gpt-4o model. ChatGPT was used to clean extracted data and generate code for figures in this manuscript, ChatGPT and Scite.ai were used in drafting all components of the manuscript, except the methods and discussion sections. Results: 3,788 articles were retrieved, and 172 studies were deemed eligible for the final review. ChatGPT and GPT-based LLM emerged as the most dominant architecture for review automation (n=126, 73.2%). A significant number of review automation projects were found, but only a limited number of papers (n=26, 15.1%) were actual reviews that used LLM during their creation. Most citations focused on automation of a particular stage of review, such as Searching for publications (n=60, 34.9%), and Data extraction (n=54, 31.4%). When comparing pooled performance of GPT-based and BERT-based models, the former were better in data extraction with mean precision 83.0% (SD=10.4), and recall 86.0% (SD=9.8), while being slightly less accurate in title and abstract screening stage (Maccuracy=77.3%, SD=13.0). Discussion/Conclusion: Our LLM-assisted systematic review revealed a significant number of research projects related to review automation using LLMs. The results looked promising, and we anticipate that LLMs will change in the near future the way the scientific reviews are conducted.

Related papers

SESR-Eval: Dataset for Evaluating LLMs in the Title-Abstract Screening of Systematic Reviews [0.9421843976231371]
We create a benchmark dataset to evaluate the performance of large language models (LLMs) in the title-abstract screening process of systematic reviews (SRs)<n>We present the SESR-Eval dataset containing 34,528 labeled primary studies, from 24 secondary studies published in software engineering (SE) journals.<n>Our benchmark enables monitoring AI performance in the screening task of SRs in software engineering.
arXiv Detail & Related papers (2025-07-25T07:27:03Z)
Can Large Language Models Be Trusted Paper Reviewers? A Feasibility Study [24.387202495452886]
This work explores the feasibility of using Large Language Models (LLMs) for academic paper review.<n>The system integrates Retrieval Augmented Generation (RAG), the AutoGen multi-agent system, and Chain-of-Thought prompting to support tasks such as format checking, standardized evaluation, comment generation, and scoring.<n>Experiments conducted on 290 submissions from the WASA 2024 conference using GPT-4o show that LLM-based review significantly reduces review time (average 2.48 hours) and cost (average $104.28 USD)
arXiv Detail & Related papers (2025-06-18T10:19:18Z)
Improving Automatic Evaluation of Large Language Models (LLMs) in Biomedical Relation Extraction via LLMs-as-the-Judge [7.064104563689608]
Large Language Models (LLMs) have demonstrated impressive performance in biomedical relation extraction.<n>This paper investigates the use of LLMs-as-the-Judge as an alternative evaluation method for biomedical relation extraction.
arXiv Detail & Related papers (2025-06-01T02:01:52Z)
MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs [54.5729817345543]
MOLE is a framework that automatically extracts metadata attributes from scientific papers covering datasets of languages other than Arabic.<n>Our methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output.
arXiv Detail & Related papers (2025-05-26T10:31:26Z)
An Empirical Study of Many-to-Many Summarization with Large Language Models [82.10000188179168]
Large language models (LLMs) have shown strong multi-lingual abilities, giving them the potential to perform Many-to-many summarization (M2MS) in real applications.<n>This work presents a systematic empirical study on LLMs' M2MS ability.
arXiv Detail & Related papers (2025-05-19T11:18:54Z)
Highlighting Case Studies in LLM Literature Review of Interdisciplinary System Science [0.18416014644193066]
Large Language Models (LLMs) were used to assist four Commonwealth Scientific and Industrial Research Organisation (CSIRO) researchers. We evaluate the performance of LLMs for systematic literature reviews.
arXiv Detail & Related papers (2025-03-16T05:52:18Z)
Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering [18.766132076075365]
Large language models (LLMs) have been deployed to tackle various software engineering (SE) tasks like code generation. Pass@k metric requires extensive unit tests and configured environments, and is not suitable for evaluating LLM-generated text. Conventional metrics like BLEU, which measure only lexical rather than semantic similarity, have also come under scrutiny.
arXiv Detail & Related papers (2025-02-10T06:49:29Z)
LitLLMs, LLMs for Literature Review: Are we there yet? [15.785989492351684]
This paper explores the zero-shot abilities of recent Large Language Models in assisting with the writing of literature reviews based on an abstract. For retrieval, we introduce a novel two-step search strategy that first uses an LLM to extract meaningful keywords from the abstract of a paper. In the generation phase, we propose a two-step approach that first outlines a plan for the review and then executes steps in the plan to generate the actual review.
arXiv Detail & Related papers (2024-12-15T01:12:26Z)
Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review [66.73247554182376]
Large language models (LLMs) have led to their integration into peer review. The unchecked adoption of LLMs poses significant risks to the integrity of the peer review system. We show that manipulating 5% of the reviews could potentially cause 12% of the papers to lose their position in the top 30% rankings.
arXiv Detail & Related papers (2024-12-02T16:55:03Z)
A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look [52.114284476700874]
This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed. We find that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness. Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits.
arXiv Detail & Related papers (2024-11-13T01:12:35Z)
SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks. SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z)
Exploring the use of a Large Language Model for data extraction in systematic reviews: a rapid feasibility study [0.28318468414401093]
This paper describes a rapid feasibility study of using GPT-4, a large language model (LLM), to (semi)automate data extraction in systematic reviews. Overall, results indicated an accuracy of around 80%, with some variability between domains.
arXiv Detail & Related papers (2024-05-23T11:24:23Z)
REASONS: A benchmark for REtrieval and Automated citationS Of scieNtific Sentences using Public and Proprietary LLMs [41.64918533152914]
We investigate whether large language models (LLMs) are capable of generating references based on two forms of sentence queries. From around 20K research articles, we make the following deductions on public and proprietary LLMs. Our study contributes valuable insights into the reliability of RAG for automated citation generation tasks.
arXiv Detail & Related papers (2024-05-03T16:38:51Z)
MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization [86.61052121715689]
MatPlotAgent is a model-agnostic framework designed to automate scientific data visualization tasks. MatPlotBench is a high-quality benchmark consisting of 100 human-verified test cases.
arXiv Detail & Related papers (2024-02-18T04:28:28Z)
Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z)
PRE: A Peer Review Based Large Language Model Evaluator [14.585292530642603]
Existing paradigms rely on either human annotators or model-based evaluators to evaluate the performance of LLMs. We propose a novel framework that can automatically evaluate LLMs through a peer-review process.
arXiv Detail & Related papers (2024-01-28T12:33:14Z)
LLMEval: A Preliminary Study on How to Evaluate Large Language Models [47.12588320134504]
We analyze evaluation methods by comparing various criteria with both manual and automatic evaluation, utilizing onsite, crowd-sourcing, public annotators and GPT-4. A total of 2,186 individuals participated, leading to the generation of 243,337 manual annotations and 57,511 automatic evaluation results.
arXiv Detail & Related papers (2023-12-12T16:14:43Z)
BooookScore: A systematic exploration of book-length summarization in the era of LLMs [53.42917858142565]
We develop an automatic metric, BooookScore, that measures the proportion of sentences in a summary that do not contain any of the identified error types. We find that closed-source LLMs such as GPT-4 and 2 produce summaries with higher BooookScore than those generated by open-source models.
arXiv Detail & Related papers (2023-10-01T20:46:44Z)
Towards an Understanding of Large Language Models in Software Engineering Tasks [29.30433406449331]
Large Language Models (LLMs) have drawn widespread attention and research due to their astounding performance in text generation and reasoning tasks. The evaluation and optimization of LLMs in software engineering tasks, such as code generation, have become a research focus. This paper comprehensively investigate and collate the research and products combining LLMs with software engineering.
arXiv Detail & Related papers (2023-08-22T12:37:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.