Detecting Reference Errors in Scientific Literature with Large Language Models
- URL: http://arxiv.org/abs/2411.06101v1
- Date: Sat, 09 Nov 2024 07:30:38 GMT
- Title: Detecting Reference Errors in Scientific Literature with Large Language Models
- Authors: Tianmai M. Zhang, Neil F. Abernethy,
- Abstract summary: This work evaluated the ability of large language models in OpenAI's GPT family to detect quotation errors.
Results showed that large language models are able to detect erroneous citations with limited context and without fine-tuning.
- Score: 0.552480439325792
- License:
- Abstract: Reference errors, such as citation and quotation errors, are common in scientific papers. Such errors can result in the propagation of inaccurate information, but are difficult and time-consuming to detect, posing a significant challenge to scientific publishing. To support automatic detection of reference errors, this work evaluated the ability of large language models in OpenAI's GPT family to detect quotation errors. Specifically, we prepared an expert-annotated, general-domain dataset of statement-reference pairs from journal articles. Large language models were evaluated in different settings with varying amounts of reference information provided by retrieval augmentation. Our results showed that large language models are able to detect erroneous citations with limited context and without fine-tuning. This study contributes to the growing literature that seeks to utilize artificial intelligence to assist in the writing, reviewing, and publishing of scientific papers. Potential avenues for further improvements in this task are also discussed.
Related papers
- Modeling citation worthiness by using attention-based bidirectional long short-term memory networks and interpretable models [0.0]
We propose a Bidirectional Long Short-Term Memory (BiLSTM) network with attention mechanism and contextual information to detect sentences that need citations.
We produce a new, large dataset (PMOA-CITE) based on PubMed Open Access Subset, which is orders of magnitude larger than previous datasets.
arXiv Detail & Related papers (2024-05-20T17:45:36Z) - xCOMET: Transparent Machine Translation Evaluation through Fine-grained
Error Detection [21.116517555282314]
xCOMET is an open-source learned metric designed to bridge the gap between machine translation evaluation approaches.
It integrates both sentence-level evaluation and error span detection capabilities, exhibiting state-of-the-art performance across all types of evaluation.
We also provide a robustness analysis with stress tests, and show that xCOMET is largely capable of identifying localized critical errors and hallucinations.
arXiv Detail & Related papers (2023-10-16T15:03:14Z) - On the application of Large Language Models for language teaching and
assessment technology [18.735612275207853]
We look at the potential for incorporating large language models in AI-driven language teaching and assessment systems.
We find that larger language models offer improvements over previous models in text generation.
For automated grading and grammatical error correction, tasks whose progress is checked on well-known benchmarks, early investigations indicate that large language models on their own do not improve on state-of-the-art results.
arXiv Detail & Related papers (2023-07-17T11:12:56Z) - Towards Fine-Grained Information: Identifying the Type and Location of
Translation Errors [80.22825549235556]
Existing approaches can not synchronously consider error position and type.
We build an FG-TED model to predict the textbf addition and textbfomission errors.
Experiments show that our model can identify both error type and position concurrently, and gives state-of-the-art results.
arXiv Detail & Related papers (2023-02-17T16:20:33Z) - Annotation Error Detection: Analyzing the Past and Present for a More
Coherent Future [63.99570204416711]
We reimplement 18 methods for detecting potential annotation errors and evaluate them on 9 English datasets.
We define a uniform evaluation setup including a new formalization of the annotation error detection task.
We release our datasets and implementations in an easy-to-use and open source software package.
arXiv Detail & Related papers (2022-06-05T22:31:45Z) - Deep Graph Learning for Anomalous Citation Detection [55.81334139806342]
We propose a novel deep graph learning model, namely GLAD (Graph Learning for Anomaly Detection), to identify anomalies in citation networks.
Within the GLAD framework, we propose an algorithm called CPU (Citation PUrpose) to discover the purpose of citation based on citation texts.
arXiv Detail & Related papers (2022-02-23T09:05:28Z) - When Does Translation Require Context? A Data-driven, Multilingual
Exploration [71.43817945875433]
proper handling of discourse significantly contributes to the quality of machine translation (MT)
Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation.
We develop the Multilingual Discourse-Aware benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena.
arXiv Detail & Related papers (2021-09-15T17:29:30Z) - CiteWorth: Cite-Worthiness Detection for Improved Scientific Document
Understanding [23.930041685595775]
We present an in-depth study of cite-worthiness detection in English, where a sentence is labelled for whether or not it cites an external source.
CiteWorth is high-quality, challenging, and suitable for studying problems such as domain adaptation.
arXiv Detail & Related papers (2021-05-23T11:08:45Z) - Curious Case of Language Generation Evaluation Metrics: A Cautionary
Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation.
This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them.
In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z) - Limits of Detecting Text Generated by Large-Scale Language Models [65.46403462928319]
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns.
Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated.
arXiv Detail & Related papers (2020-02-09T19:53:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.