Related papers: Bridging the LLM Accessibility Divide? Performance, Fairness, and Cost of Closed versus Open LLMs for Automated Essay Scoring

Bridging the LLM Accessibility Divide? Performance, Fairness, and Cost of Closed versus Open LLMs for Automated Essay Scoring

URL: http://arxiv.org/abs/2503.11827v1
Date: Fri, 14 Mar 2025 19:34:40 GMT
Title: Bridging the LLM Accessibility Divide? Performance, Fairness, and Cost of Closed versus Open LLMs for Automated Essay Scoring
Authors: Kezia Oketch, John P. Lalor, Yi Yang, Ahmed Abbasi,
Abstract summary: We compare nine leading large language models (LLMs) across text assessment and generation tasks related to automated essay scoring.<n>Our findings reveal that for few-shot learning-based assessment of human generated essays, open LLMs such as Llama 3 and Qwen2.5 perform comparably to GPT-4 in terms of predictive performance.<n>For generative tasks, we find that essays generated by top open LLMs are comparable to closed LLMs in terms of their semantic composition/embeddings and ML assessed scores.
Score: 18.33969226071914
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Closed large language models (LLMs) such as GPT-4 have set state-of-the-art results across a number of NLP tasks and have become central to NLP and machine learning (ML)-driven solutions. Closed LLMs' performance and wide adoption has sparked considerable debate about their accessibility in terms of availability, cost, and transparency. In this study, we perform a rigorous comparative analysis of nine leading LLMs, spanning closed, open, and open-source LLM ecosystems, across text assessment and generation tasks related to automated essay scoring. Our findings reveal that for few-shot learning-based assessment of human generated essays, open LLMs such as Llama 3 and Qwen2.5 perform comparably to GPT-4 in terms of predictive performance, with no significant differences in disparate impact scores when considering age- or race-related fairness. Moreover, Llama 3 offers a substantial cost advantage, being up to 37 times more cost-efficient than GPT-4. For generative tasks, we find that essays generated by top open LLMs are comparable to closed LLMs in terms of their semantic composition/embeddings and ML assessed scores. Our findings challenge the dominance of closed LLMs and highlight the democratizing potential of open LLMs, suggesting they can effectively bridge accessibility divides while maintaining competitive performance and fairness.

Related papers

LLM2: Let Large Language Models Harness System 2 Reasoning [65.89293674479907]
Large language models (LLMs) have exhibited impressive capabilities across a myriad of tasks, yet they occasionally yield undesirable outputs. We introduce LLM2, a novel framework that combines an LLM with a process-based verifier. LLMs2 is responsible for generating plausible candidates, while the verifier provides timely process-based feedback to distinguish desirable and undesirable outputs.
arXiv Detail & Related papers (2024-12-29T06:32:36Z)
GIVE: Structured Reasoning of Large Language Models with Knowledge Graph Inspired Veracity Extrapolation [108.2008975785364]
Graph Inspired Veracity Extrapolation (GIVE) is a novel reasoning method that merges parametric and non-parametric memories to improve accurate reasoning with minimal external input. GIVE guides the LLM agent to select the most pertinent expert data (observe), engage in query-specific divergent thinking (reflect), and then synthesize this information to produce the final output (speak)
arXiv Detail & Related papers (2024-10-11T03:05:06Z)
Beyond ChatGPT: Enhancing Software Quality Assurance Tasks with Diverse LLMs and Validation Techniques [14.230480872339463]
This paper investigates the capabilities of several Large Language Models (LLMs) across two SQA tasks: fault localization and vulnerability detection. By implementing a voting mechanism to combine the LLMs' results, we achieved more than a 10% improvement over the GPT-3.5 in both tasks. This approach led to performance improvements of 16% in fault localization and 12% in vulnerability detection compared to the GPT-3.5, with a 4% improvement compared to the best-performed LLMs.
arXiv Detail & Related papers (2024-09-02T07:26:19Z)
Beyond Numeric Awards: In-Context Dueling Bandits with LLM Agents [25.825941077332182]
This paper is the first to investigate Large Language Models (LLMs) as in-context decision-makers under the problem of Dueling Bandits (DB)<n>We compare GPT-3.5 Turbo, GPT-4, GPT-4 Turbo, Llama 3.1, and o1-Preview against nine well-established DB algorithms.<n>We show that our top-performing LLM, GPT-4 Turbo, has the zero-shot relative decision-making ability to achieve surprisingly low weak regret.
arXiv Detail & Related papers (2024-07-02T02:18:14Z)
DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators. The question of how reliable these evaluators are has emerged as a crucial research question. We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z)
Fairness in Large Language Models: A Taxonomic Survey [2.669847575321326]
Large Language Models (LLMs) have demonstrated remarkable success across various domains. Despite their promising performance in numerous real-world applications, most of these algorithms lack fairness considerations.
arXiv Detail & Related papers (2024-03-31T22:22:53Z)
FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks. We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z)
Identifying Factual Inconsistencies in Summaries: Grounding LLM Inference via Task Taxonomy [48.29181662640212]
Factual inconsistencies pose a significant hurdle for the faithful summarization by generative models. We consolidate key error types of inconsistent facts in summaries, and incorporate them to facilitate both the zero-shot and supervised paradigms of LLMs.
arXiv Detail & Related papers (2024-02-20T08:41:23Z)
Enabling Weak LLMs to Judge Response Reliability via Meta Ranking [38.63721941742435]
We propose a novel cross-query-comparison-based method called $textitMeta Ranking$ (MR) MR assesses reliability by pairwisely ranking the target query-response pair with multiple reference query-response pairs. We show that MR can enhance strong LLMs' performance in two practical applications: model cascading and instruction tuning.
arXiv Detail & Related papers (2024-02-19T13:57:55Z)
Benchmarking LLMs via Uncertainty Quantification [91.72588235407379]
The proliferation of open-source Large Language Models (LLMs) has highlighted the urgent need for comprehensive evaluation methods. We introduce a new benchmarking approach for LLMs that integrates uncertainty quantification. Our findings reveal that: I) LLMs with higher accuracy may exhibit lower certainty; II) Larger-scale LLMs may display greater uncertainty compared to their smaller counterparts; and III) Instruction-finetuning tends to increase the uncertainty of LLMs.
arXiv Detail & Related papers (2024-01-23T14:29:17Z)
A Comparative Analysis of Fine-Tuned LLMs and Few-Shot Learning of LLMs for Financial Sentiment Analysis [0.0]
We employ two approaches: in-context learning and fine-tuning LLMs on a finance-domain dataset. Our results demonstrate that fine-tuned smaller LLMs can achieve comparable performance to state-of-the-art fine-tuned LLMs. There is no observed enhancement in performance for finance-domain sentiment analysis when the number of shots for in-context learning is increased.
arXiv Detail & Related papers (2023-12-14T08:13:28Z)
Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity [61.54815512469125]
This survey addresses the crucial issue of factuality in Large Language Models (LLMs) As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital.
arXiv Detail & Related papers (2023-10-11T14:18:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.