Related papers: Can Small and Reasoning Large Language Models Score Journal Articles for Research Quality and Do Averaging and Few-shot Help?

Can Small and Reasoning Large Language Models Score Journal Articles for Research Quality and Do Averaging and Few-shot Help?

URL: http://arxiv.org/abs/2510.22389v1
Date: Sat, 25 Oct 2025 18:12:41 GMT
Title: Can Small and Reasoning Large Language Models Score Journal Articles for Research Quality and Do Averaging and Few-shot Help?
Authors: Mike Thelwall, Ehsan Mohammadi,
Abstract summary: It is unclear whether smaller LLMs and reasoning models have similar abilities.<n>This is important because larger models may be slow and impractical in some situations, and reasoning models may perform differently.<n>Four relevant questions are addressed with Gemma3 variants, Llama4 Scout, Qwen3, Magistral Small and DeepSeek R1.<n>Results suggest that smaller (open weights) and reasoning LLMs have similar performance to ChatGPT 4o-mini and Gemini 2.0 Flash.
Score: 3.920564895363768
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Assessing published academic journal articles is a common task for evaluations of departments and individuals. Whilst it is sometimes supported by citation data, Large Language Models (LLMs) may give more useful indications of article quality. Evidence of this capability exists for two of the largest LLM families, ChatGPT and Gemini, and the medium sized LLM Gemma3 27b, but it is unclear whether smaller LLMs and reasoning models have similar abilities. This is important because larger models may be slow and impractical in some situations, and reasoning models may perform differently. Four relevant questions are addressed with Gemma3 variants, Llama4 Scout, Qwen3, Magistral Small and DeepSeek R1, on a dataset of 2,780 medical, health and life science papers in 6 fields, with two different gold standards, one novel. The results suggest that smaller (open weights) and reasoning LLMs have similar performance to ChatGPT 4o-mini and Gemini 2.0 Flash, but that 1b parameters may often, and 4b sometimes, be too few. Moreover, averaging scores from multiple identical queries seems to be a universally successful strategy, and few-shot prompts (four examples) tended to help but the evidence was equivocal. Reasoning models did not have a clear advantage. Overall, the results show, for the first time, that smaller LLMs >4b, including reasoning models, have a substantial capability to score journal articles for research quality, especially if score averaging is used.

Related papers

The Riddle of Reflection: Evaluating Reasoning and Self-Awareness in Multilingual LLMs using Indian Riddles [1.0732935873226022]
This paper examines the reasoning and self-assessment abilities of LLMs across seven major Indian languages.<n>We introduce a multilingual riddle dataset combining traditional riddles with context-reconstructed variants.<n>We evaluate five LLMs-Gemini 2.5 Pro, Gemini 2.5 Flash, Mistral-Saba, LLaMA 4 Scout, and LLaMA 4 Maverick-under seven prompting strategies.
arXiv Detail & Related papers (2025-11-02T14:40:36Z)
Let's Use ChatGPT To Write Our Paper! Benchmarking LLMs To Write the Introduction of a Research Paper [64.50822834679101]
SciIG is a task that evaluates LLMs' ability to produce coherent introductions from titles, abstracts, and related works.<n>We assess five state-of-the-art models, including open-source (DeepSeek-v3, Gemma-3-12B, LLaMA 4-Maverick, MistralAI Small 3.1) and closed-source GPT-4o systems.<n>Results demonstrate LLaMA-4 Maverick's superior performance on most metrics, particularly in semantic similarity and faithfulness.
arXiv Detail & Related papers (2025-08-19T21:11:11Z)
Can Smaller Large Language Models Evaluate Research Quality? [3.9627148816681284]
This article assesses Google's Gemma-3-27b-it, a downloadable LLM (60Gb)<n>The results for 104,187 articles show that Gemma-3-27b-it scores correlate positively with an expert research quality score proxy for all 34 Units of Assessment (broad fields) from the UK Research Excellence Framework 2021.
arXiv Detail & Related papers (2025-08-10T06:18:40Z)
An Empirical Study of Many-to-Many Summarization with Large Language Models [82.10000188179168]
Large language models (LLMs) have shown strong multi-lingual abilities, giving them the potential to perform Many-to-many summarization (M2MS) in real applications.<n>This work presents a systematic empirical study on LLMs' M2MS ability.
arXiv Detail & Related papers (2025-05-19T11:18:54Z)
Investigating Retrieval-Augmented Generation in Quranic Studies: A Study of 13 Open-Source Large Language Models [0.18846515534317265]
General-purpose large language models (LLMs) often struggle with hallucinations, where generated responses deviate from authoritative sources.<n>This challenge highlights the need for systems that can integrate domain-specific knowledge while maintaining response accuracy, relevance, and faithfulness.<n>This research utilizes a descriptive dataset of Quranic surahs including the meanings, historical context, and qualities of the 114 surahs.<n>The models are evaluated using three key metrics set by human evaluators: context relevance, answer faithfulness, and answer relevance.
arXiv Detail & Related papers (2025-03-20T13:26:30Z)
Large Language Models as Misleading Assistants in Conversation [8.557086720583802]
We investigate the ability of Large Language Models (LLMs) to be deceptive in the context of providing assistance on a reading comprehension task. We compare outcomes of (1) when the model is prompted to provide truthful assistance, (2) when it is prompted to be subtly misleading, and (3) when it is prompted to argue for an incorrect answer.
arXiv Detail & Related papers (2024-07-16T14:45:22Z)
Attribute or Abstain: Large Language Models as Long Document Assistants [58.32043134560244]
LLMs can help humans working with long documents, but are known to hallucinate. Existing approaches to attribution have only been evaluated in RAG settings, where the initial retrieval confounds LLM performance. This is crucially different from the long document setting, where retrieval is not needed, but could help. We present LAB, a benchmark of 6 diverse long document tasks with attribution, and experiments with different approaches to attribution on 5 LLMs of different sizes.
arXiv Detail & Related papers (2024-07-10T16:16:02Z)
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models [56.02275285521847]
We propose to evaluate models using a Panel of LLm evaluators (PoLL) We find that using a PoLL composed of a larger number of smaller models outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.
arXiv Detail & Related papers (2024-04-29T15:33:23Z)
The Larger the Better? Improved LLM Code-Generation via Budget Reallocation [32.0844209512788]
It is a common belief that large language models (LLMs) are better than smaller-sized ones. This begs the question: what happens when both models operate under the same budget? We analyze code generation LLMs of various sizes and make comparisons such as running a 70B model once vs. generating five outputs from a 13B model.
arXiv Detail & Related papers (2024-03-31T15:55:49Z)
Can Large Language Models Automatically Score Proficiency of Written Essays? [3.993602109661159]
Large Language Models (LLMs) are transformer-based models that demonstrate extraordinary capabilities on various tasks. We test the ability of LLMs, given their powerful linguistic knowledge, to analyze and effectively score written essays.
arXiv Detail & Related papers (2024-03-10T09:39:00Z)
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship. We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3. While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z)
Specializing Smaller Language Models towards Multi-Step Reasoning [56.78474185485288]
We show that abilities can be distilled down from GPT-3.5 ($ge$ 175B) to T5 variants ($le$ 11B) We propose model specialization, to specialize the model's ability towards a target task.
arXiv Detail & Related papers (2023-01-30T08:51:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.