Evaluating the Predictive Capacity of ChatGPT for Academic Peer Review Outcomes Across Multiple Platforms
- URL: http://arxiv.org/abs/2411.09763v1
- Date: Thu, 14 Nov 2024 19:20:33 GMT
- Title: Evaluating the Predictive Capacity of ChatGPT for Academic Peer Review Outcomes Across Multiple Platforms
- Authors: Mike Thelwall, Abdullah Yaghi,
- Abstract summary: This paper introduces two new contexts and employs a more robust method - averaging multiple ChatGPT scores.
Findings that averaging 30 ChatGPT predictions, based on reviewer guidelines and using only submitted titles and abstracts, failed to predict peer review outcomes for F1000Research.
- Score: 3.3543455244780223
- License:
- Abstract: While previous studies have demonstrated that Large Language Models (LLMs) can predict peer review outcomes to some extent, this paper builds on that by introducing two new contexts and employing a more robust method - averaging multiple ChatGPT scores. The findings that averaging 30 ChatGPT predictions, based on reviewer guidelines and using only the submitted titles and abstracts, failed to predict peer review outcomes for F1000Research (Spearman's rho=0.00). However, it produced mostly weak positive correlations with the quality dimensions of SciPost Physics (rho=0.25 for validity, rho=0.25 for originality, rho=0.20 for significance, and rho = 0.08 for clarity) and a moderate positive correlation for papers from the International Conference on Learning Representations (ICLR) (rho=0.38). Including the full text of articles significantly increased the correlation for ICLR (rho=0.46) and slightly improved it for F1000Research (rho=0.09), while it had variable effects on the four quality dimension correlations for SciPost LaTeX files. The use of chain-of-thought system prompts slightly increased the correlation for F1000Research (rho=0.10), marginally reduced it for ICLR (rho=0.37), and further decreased it for SciPost Physics (rho=0.16 for validity, rho=0.18 for originality, rho=0.18 for significance, and rho=0.05 for clarity). Overall, the results suggest that in some contexts, ChatGPT can produce weak pre-publication quality assessments. However, the effectiveness of these assessments and the optimal strategies for employing them vary considerably across different platforms, journals, and conferences. Additionally, the most suitable inputs for ChatGPT appear to differ depending on the platform.
Related papers
- Evaluating Research Quality with Large Language Models: An Analysis of ChatGPT's Effectiveness with Different Settings and Inputs [3.9627148816681284]
This article assesses which ChatGPT inputs produce better quality score estimates.
The optimal input is the article title and abstract, with average ChatGPT scores based on these correlating at 0.67 with human scores.
arXiv Detail & Related papers (2024-08-13T09:19:21Z) - Is ChatGPT Transforming Academics' Writing Style? [0.0]
Based on one million arXiv papers submitted from May 2018 to January 2024, we assess the textual density of ChatGPT's writing style in their abstracts.
We find that large language models (LLMs), represented by ChatGPT, are having an increasing impact on arXiv abstracts.
arXiv Detail & Related papers (2024-04-12T17:41:05Z) - Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews [51.453135368388686]
We present an approach for estimating the fraction of text in a large corpus which is likely to be substantially modified or produced by a large language model (LLM)
Our maximum likelihood model leverages expert-written and AI-generated reference texts to accurately and efficiently examine real-world LLM-use at the corpus level.
arXiv Detail & Related papers (2024-03-11T21:51:39Z) - Can ChatGPT evaluate research quality? [3.9627148816681284]
ChatGPT-4 can produce plausible document summaries and quality evaluation rationales that match REF criteria.
Overall, ChatGPT does not yet seem to be accurate enough to be trusted for any formal or informal research quality evaluation tasks.
arXiv Detail & Related papers (2024-02-08T10:00:40Z) - Eval-GCSC: A New Metric for Evaluating ChatGPT's Performance in Chinese
Spelling Correction [60.32771192285546]
ChatGPT has demonstrated impressive performance in various downstream tasks.
In the Chinese Spelling Correction (CSC) task, we observe a discrepancy: while ChatGPT performs well under human evaluation, it scores poorly according to traditional metrics.
This paper proposes a new evaluation metric: Eval-GCSC. By incorporating word-level and semantic similarity judgments, it relaxes the stringent length and phonics constraints.
arXiv Detail & Related papers (2023-11-14T14:56:33Z) - Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings [63.35165397320137]
This study investigates the consistency of feedback ratings generated by OpenAI's GPT-4.
The model rated responses to tasks within the Higher Education subject domain of macroeconomics in terms of their content and style.
arXiv Detail & Related papers (2023-08-03T12:47:17Z) - Shall We Pretrain Autoregressive Language Models with Retrieval? A
Comprehensive Study [115.96080028033904]
We study a scalable pre-trained retrieval-augmented LM (i.e., RETRO) compared with standard GPT and retrieval-augmented GPT.
Our findings highlight the promising direction of pretraining autoregressive LMs with retrieval as future foundation models.
arXiv Detail & Related papers (2023-04-13T18:04:19Z) - Reducing Variance in Temporal-Difference Value Estimation via Ensemble
of Deep Networks [109.59988683444986]
MeanQ is a simple ensemble method that estimates target values as ensemble means.
We show that MeanQ shows remarkable sample efficiency in experiments on the Atari Learning Environment benchmark.
arXiv Detail & Related papers (2022-09-16T01:47:36Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - Out-of-Vocabulary Entities in Link Prediction [1.9036571490366496]
Link prediction is often used as a proxy to evaluate the quality of embeddings.
As benchmarks are crucial for the fair comparison of algorithms, ensuring their quality is tantamount to providing a solid ground for developing better solutions.
We provide an implementation of an approach for spotting and removing such entities and provide corrected versions of the datasets WN18RR, FB15K-237, and YAGO3-10.
arXiv Detail & Related papers (2021-05-26T12:58:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.