Predicting article quality scores with machine learning: The UK Research
Excellence Framework
- URL: http://arxiv.org/abs/2212.05415v1
- Date: Sun, 11 Dec 2022 05:45:12 GMT
- Title: Predicting article quality scores with machine learning: The UK Research
Excellence Framework
- Authors: Mike Thelwall, Kayvan Kousha, Mahshid Abdoli, Emma Stuart, Meiko
Makita, Paul Wilson, Jonathan Levitt, Petr Knoth, Matteo Cancellieri
- Abstract summary: Accuracy is highest in the medical and physical sciences Units of Assessment (UoAs) and economics.
Prediction accuracies above the baseline for the social science, mathematics, engineering, arts, humanities, and UoAs were much lower or close to zero.
We increased accuracy with an active learning strategy and by selecting articles with higher prediction probabilities, as estimated by the algorithms, but this substantially reduced the number of scores predicted.
- Score: 6.582887504429817
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: National research evaluation initiatives and incentive schemes have
previously chosen between simplistic quantitative indicators and time-consuming
peer review, sometimes supported by bibliometrics. Here we assess whether
artificial intelligence (AI) could provide a third alternative, estimating
article quality using more multiple bibliometric and metadata inputs. We
investigated this using provisional three-level REF2021 peer review scores for
84,966 articles submitted to the UK Research Excellence Framework 2021,
matching a Scopus record 2014-18 and with a substantial abstract. We found that
accuracy is highest in the medical and physical sciences Units of Assessment
(UoAs) and economics, reaching 42% above the baseline (72% overall) in the best
case. This is based on 1000 bibliometric inputs and half of the articles used
for training in each UoA. Prediction accuracies above the baseline for the
social science, mathematics, engineering, arts, and humanities UoAs were much
lower or close to zero. The Random Forest Classifier (standard or ordinal) and
Extreme Gradient Boosting Classifier algorithms performed best from the 32
tested. Accuracy was lower if UoAs were merged or replaced by Scopus broad
categories. We increased accuracy with an active learning strategy and by
selecting articles with higher prediction probabilities, as estimated by the
algorithms, but this substantially reduced the number of scores predicted.
Related papers
- Analysis of the ICML 2023 Ranking Data: Can Authors' Opinions of Their Own Papers Assist Peer Review in Machine Learning? [52.00419656272129]
We conducted an experiment during the 2023 International Conference on Machine Learning (ICML)
We received 1,342 rankings, each from a distinct author, pertaining to 2,592 submissions.
We focus on the Isotonic Mechanism, which calibrates raw review scores using author-provided rankings.
arXiv Detail & Related papers (2024-08-24T01:51:23Z) - Evaluating Research Quality with Large Language Models: An Analysis of ChatGPT's Effectiveness with Different Settings and Inputs [3.9627148816681284]
This article assesses which ChatGPT inputs produce better quality score estimates.
The optimal input is the article title and abstract, with average ChatGPT scores based on these correlating at 0.67 with human scores.
arXiv Detail & Related papers (2024-08-13T09:19:21Z) - New Directions in Text Classification Research: Maximizing The Performance of Sentiment Classification from Limited Data [0.0]
A benchmark dataset is provided for training and testing data on the issue of Kaesang Pangarep's appointment as Chairman of PSI.
The official score used is the F1-score, which balances precision and recall among the three classes, positive, negative, and neutral.
Both scoring (baseline and optimized) use the SVM method, which is widely reported as the state-of-the-art in conventional machine learning methods.
arXiv Detail & Related papers (2024-07-08T05:42:29Z) - Regularization-Based Methods for Ordinal Quantification [49.606912965922504]
We study the ordinal case, i.e., the case in which a total order is defined on the set of n>2 classes.
We propose a novel class of regularized OQ algorithms, which outperforms existing algorithms in our experiments.
arXiv Detail & Related papers (2023-10-13T16:04:06Z) - ASPEST: Bridging the Gap Between Active Learning and Selective
Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain.
Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples.
In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z) - A Gold Standard Dataset for the Reviewer Assignment Problem [117.59690218507565]
"Similarity score" is a numerical estimate of the expertise of a reviewer in reviewing a paper.
Our dataset consists of 477 self-reported expertise scores provided by 58 researchers.
For the task of ordering two papers in terms of their relevance for a reviewer, the error rates range from 12%-30% in easy cases to 36%-43% in hard cases.
arXiv Detail & Related papers (2023-03-23T16:15:03Z) - Parametric Classification for Generalized Category Discovery: A Baseline
Study [70.73212959385387]
Generalized Category Discovery (GCD) aims to discover novel categories in unlabelled datasets using knowledge learned from labelled samples.
We investigate the failure of parametric classifiers, verify the effectiveness of previous design choices when high-quality supervision is available, and identify unreliable pseudo-labels as a key problem.
We propose a simple yet effective parametric classification method that benefits from entropy regularisation, achieves state-of-the-art performance on multiple GCD benchmarks and shows strong robustness to unknown class numbers.
arXiv Detail & Related papers (2022-11-21T18:47:11Z) - Out-of-Vocabulary Entities in Link Prediction [1.9036571490366496]
Link prediction is often used as a proxy to evaluate the quality of embeddings.
As benchmarks are crucial for the fair comparison of algorithms, ensuring their quality is tantamount to providing a solid ground for developing better solutions.
We provide an implementation of an approach for spotting and removing such entities and provide corrected versions of the datasets WN18RR, FB15K-237, and YAGO3-10.
arXiv Detail & Related papers (2021-05-26T12:58:18Z) - Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring
Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics.
We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.