Related papers: PerQ: Efficient Evaluation of Multilingual Text Personalization Quality

PerQ: Efficient Evaluation of Multilingual Text Personalization Quality

URL: http://arxiv.org/abs/2509.25903v1
Date: Tue, 30 Sep 2025 07:48:14 GMT
Title: PerQ: Efficient Evaluation of Multilingual Text Personalization Quality
Authors: Dominik Macko, Andrew Pulver,
Abstract summary: Since no metrics are available to evaluate specific aspects of a text, such as its personalization quality, the researchers often rely solely on large language models to meta-evaluate such texts.<n>In this paper, a computationally efficient method for evaluation of personalization quality of a given text (generated by a language model) is introduced, called PerQ.
Score: 3.0156689030741
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Since no metrics are available to evaluate specific aspects of a text, such as its personalization quality, the researchers often rely solely on large language models to meta-evaluate such texts. Due to internal biases of individual language models, it is recommended to use multiple of them for combined evaluation, which directly increases costs of such meta-evaluation. In this paper, a computationally efficient method for evaluation of personalization quality of a given text (generated by a language model) is introduced, called PerQ. A case study of comparison of generation capabilities of large and small language models shows the usability of the proposed metric in research, effectively reducing the waste of resources.

Related papers

MTQ-Eval: Multilingual Text Quality Evaluation for Language Models [4.239775815863115]
MTQ-Eval is a novel framework for multilingual text quality evaluation.<n>It learns from examples of both high- and low-quality texts, adjusting its internal representations.<n>Our comprehensive evaluation across 115 languages demonstrates the improved performance of the proposed model.
arXiv Detail & Related papers (2025-11-12T14:42:23Z)
Objective Metrics for Evaluating Large Language Models Using External Data Sources [4.574672973076743]
This paper proposes a framework for leveraging subjective metrics derived from the class textual materials across different semesters.<n>The framework emphasizes automation and transparency in scoring, reducing reliance on human interpretation.<n>This method addresses the limitations of subjective evaluation methods, providing a scalable solution for performance assessment in educational, scientific, and other high-stakes domains.
arXiv Detail & Related papers (2025-08-01T02:24:19Z)
Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z)
Automated Evaluation of Personalized Text Generation using Large Language Models [38.2211640679274]
We present AuPEL, a novel evaluation method that distills three major semantic aspects of the generated text: personalization, quality and relevance, and automatically measures these aspects. We find that, compared to existing evaluation metrics, AuPEL not only distinguishes and ranks models based on their personalization abilities more accurately, but also presents commendable consistency and efficiency for this task.
arXiv Detail & Related papers (2023-10-17T21:35:06Z)
Calibrating LLM-Based Evaluator [92.17397504834825]
We propose AutoCalibrate, a multi-stage, gradient-free approach to calibrate and align an LLM-based evaluator toward human preference. Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels. Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration.
arXiv Detail & Related papers (2023-09-23T08:46:11Z)
Learning Evaluation Models from Large Language Models for Sequence Generation [61.8421748792555]
We propose a three-stage evaluation model training method that utilizes large language models to generate labeled data for model-based metric development.<n> Experimental results on the SummEval benchmark demonstrate that CSEM can effectively train an evaluation model without human-labeled data.
arXiv Detail & Related papers (2023-08-08T16:41:16Z)
Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study [63.27346930921658]
ChatGPT is capable of evaluating text quality effectively from various perspectives without reference. The Explicit Score, which utilizes ChatGPT to generate a numeric score measuring text quality, is the most effective and reliable method among the three exploited approaches.
arXiv Detail & Related papers (2023-04-03T05:29:58Z)
Large Language Models are Diverse Role-Players for Summarization Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal. Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions. We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z)
On the Usefulness of Embeddings, Clusters and Strings for Text Generator Evaluation [86.19634542434711]
Mauve measures an information-theoretic divergence between two probability distributions over strings. We show that Mauve was right for the wrong reasons, and that its newly proposed divergence is not necessary for its high performance. We conclude that -- by encoding syntactic- and coherence-level features of text, while ignoring surface-level features -- such cluster-based substitutes to string distributions may simply be better for evaluating state-of-the-art language generators.
arXiv Detail & Related papers (2022-05-31T17:58:49Z)
Language Model Evaluation in Open-ended Text Generation [0.76146285961466]
We study different evaluation metrics that have been proposed to evaluate quality, diversity and consistency of machine-generated text. From there, we propose a practical pipeline to evaluate language models in open-ended generation task.
arXiv Detail & Related papers (2021-08-08T06:16:02Z)
Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation. This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them. In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.