Can Large Language Models Outperform Non-Experts in Poetry Evaluation? A Comparative Study Using the Consensual Assessment Technique
- URL: http://arxiv.org/abs/2502.19064v2
- Date: Sat, 04 Oct 2025 09:24:24 GMT
- Title: Can Large Language Models Outperform Non-Experts in Poetry Evaluation? A Comparative Study Using the Consensual Assessment Technique
- Authors: Piotr Sawicki, Marek Grześ, Dan Brown, Fabrício Góes,
- Abstract summary: This study adapts the Consensual Assessment Technique (CAT) for Large Language Models (LLMs)<n>We demonstrate that this approach allows LLMs to significantly surpass the performance of non-expert human judges.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study adapts the Consensual Assessment Technique (CAT) for Large Language Models (LLMs), introducing a novel methodology for poetry evaluation. Using a 90-poem dataset with a ground truth based on publication venue, we demonstrate that this approach allows LLMs to significantly surpass the performance of non-expert human judges. Our method, which leverages forced-choice ranking within small, randomized batches, enabled Claude-3-Opus to achieve a Spearman's Rank Correlation of 0.87 with the ground truth, dramatically outperforming the best human non-expert evaluation (SRC = 0.38). The LLM assessments also exhibited high inter-rater reliability, underscoring the methodology's robustness. These findings establish that LLMs, when guided by a comparative framework, can be effective and reliable tools for assessing poetry, paving the way for their broader application in other creative domains.
Related papers
- AllSummedUp: un framework open-source pour comparer les metriques d'evaluation de resume [2.2153783542347805]
This paper investigates challenges in automatic text summarization evaluation.<n>Based on experiments conducted across six representative metrics, we highlight significant discrepancies between reported performances in the literature and those observed in our experimental setting.<n>We introduce a unified, open-source framework, applied to the SummEval dataset and designed to support fair and transparent comparison of evaluation metrics.
arXiv Detail & Related papers (2025-08-29T08:05:00Z) - Evaluating the Evaluators: Are readability metrics good measures of readability? [36.138020084479784]
Plain Language Summarization (PLS) aims to distill complex documents into accessible summaries for non-expert audiences.<n>Traditional readability metrics, such as Flesch-Kincaid Grade Level (FKGL), have not been compared to human readability judgments in PLS.<n>We show that Language Models (LMs) are better judges of readability, with the best-performing model achieving a Pearson correlation of 0.56 with human judgments.
arXiv Detail & Related papers (2025-08-26T17:38:42Z) - Beyond "Not Novel Enough": Enriching Scholarly Critique with LLM-Assisted Feedback [81.0031690510116]
We present a structured approach for automated novelty evaluation that models expert reviewer behavior through three stages.<n>Our method is informed by a large scale analysis of human written novelty reviews.<n> Evaluated on 182 ICLR 2025 submissions, the approach achieves 86.5% alignment with human reasoning and 75.3% agreement on novelty conclusions.
arXiv Detail & Related papers (2025-08-14T16:18:37Z) - Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons [13.187011661009459]
Large Language Models (LLMs) have shown to be effective evaluators across various domains.<n>We present Knockout Assessment, an LLM-asa Judge method using a knockout tournament system with iterative pairwise comparisons.
arXiv Detail & Related papers (2025-06-04T09:46:43Z) - LLMs Are Not Scorers: Rethinking MT Evaluation with Generation-Based Methods [0.0]
We propose a generation-based evaluation paradigm that leverages decoder-only language models to produce high-quality references.<n> Empirical results show that our method outperforms both intra-LLM direct scoring baselines and external non-LLM reference-free metrics from MTME.
arXiv Detail & Related papers (2025-05-22T02:14:38Z) - LLMs can Perform Multi-Dimensional Analytic Writing Assessments: A Case Study of L2 Graduate-Level Academic English Writing [10.239220270988136]
We use a corpus of literature reviews written by L2 graduate students and assessed by human experts against 9 analytic criteria.
To evaluate the quality of feedback comments, we apply a novel feedback comment quality evaluation framework.
We find that LLMs can generate reasonably good and generally reliable multi-dimensional analytic assessments.
arXiv Detail & Related papers (2025-02-17T02:31:56Z) - Are LLMs Good Literature Review Writers? Evaluating the Literature Review Writing Ability of Large Language Models [2.048226951354646]
We propose a framework to assess the literature review writing ability of large language models automatically.<n>We evaluate the performance of LLMs across three tasks: generating references, writing abstracts, and writing literature reviews.
arXiv Detail & Related papers (2024-12-18T08:42:25Z) - Language Model Preference Evaluation with Multiple Weak Evaluators [89.90733463933431]
We introduce PGED, a novel approach that leverages multiple model-based evaluators to construct preference graphs, and then ensembles and denoises these graphs for acyclic, non-contradictory evaluation results.<n>We demonstrate PGED's superiority in three applications: 1) model ranking for evaluation, 2) response selection for test-time scaling, and 3) data selection for model fine-tuning.
arXiv Detail & Related papers (2024-10-14T01:57:25Z) - Large Language Models for Classical Chinese Poetry Translation: Benchmarking, Evaluating, and Improving [43.148203559785095]
Large language models (LLMs) with impressive multilingual capabilities may bring a ray of hope to achieve this extreme translation demand.<n>This paper first introduces a suitable benchmark (PoetMT) where each Chinese poetry has a recognized elegant translation.<n>We propose a new metric based on GPT-4 to evaluate the extent to which current LLMs can meet these demands.
arXiv Detail & Related papers (2024-08-19T12:34:31Z) - A Comparative Study of Quality Evaluation Methods for Text Summarization [0.5512295869673147]
This paper proposes a novel method based on large language models (LLMs) for evaluating text summarization.
Our results show that LLMs evaluation aligns closely with human evaluation, while widely-used automatic metrics such as ROUGE-2, BERTScore, and SummaC do not and also lack consistency.
arXiv Detail & Related papers (2024-06-30T16:12:37Z) - Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets [3.0040661953201475]
Large language models (LLMs) can now generate and recognize poetry.
We develop a task to evaluate how well LLMs recognize one aspect of English-language poetry.
We show that state-of-the-art LLMs can successfully identify both common and uncommon fixed poetic forms.
arXiv Detail & Related papers (2024-06-27T05:36:53Z) - Evaluating LLMs for Quotation Attribution in Literary Texts: A Case Study of LLaMa3 [11.259583037191772]
We evaluate the ability of Llama-3 at attributing utterances of direct-speech to their speaker in novels.
The LLM shows impressive results on a corpus of 28 novels, surpassing published results with ChatGPT and encoder-based baselines by a large margin.
arXiv Detail & Related papers (2024-06-17T09:56:46Z) - DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators [48.54465599914978]
Large Language Models (LLMs) have demonstrated promising capabilities as automatic evaluators in assessing the quality of generated natural language.<n>LLMs still exhibit biases in evaluation and often struggle to generate coherent evaluations that align with human assessments.<n>We introduce Pairwise-preference Search (PAIRS), an uncertainty-guided search-based rank aggregation method that employs LLMs to conduct pairwise comparisons locally and efficiently ranks candidate texts globally.
arXiv Detail & Related papers (2024-03-25T17:11:28Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs.
We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z) - Art or Artifice? Large Language Models and the False Promise of
Creativity [53.04834589006685]
We propose the Torrance Test of Creative Writing (TTCW) to evaluate creativity as a product.
TTCW consists of 14 binary tests organized into the original dimensions of Fluency, Flexibility, Originality, and Elaboration.
Our analysis shows that LLM-generated stories pass 3-10X less TTCW tests than stories written by professionals.
arXiv Detail & Related papers (2023-09-25T22:02:46Z) - L-Eval: Instituting Standardized Evaluation for Long Context Language
Models [91.05820785008527]
We propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs)
We build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs.
Results show that popular n-gram matching metrics generally can not correlate well with human judgment.
arXiv Detail & Related papers (2023-07-20T17:59:41Z) - Document-Level Machine Translation with Large Language Models [91.03359121149595]
Large language models (LLMs) can produce coherent, cohesive, relevant, and fluent answers for various natural language processing (NLP) tasks.
This paper provides an in-depth evaluation of LLMs' ability on discourse modeling.
arXiv Detail & Related papers (2023-04-05T03:49:06Z) - Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood.
We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.