Related papers: MTQ-Eval: Multilingual Text Quality Evaluation for Language Models

MTQ-Eval: Multilingual Text Quality Evaluation for Language Models

URL: http://arxiv.org/abs/2511.09374v1
Date: Thu, 13 Nov 2025 01:50:28 GMT
Title: MTQ-Eval: Multilingual Text Quality Evaluation for Language Models
Authors: Rhitabrat Pokharel, Ameeta Agrawal,
Abstract summary: MTQ-Eval is a novel framework for multilingual text quality evaluation.<n>It learns from examples of both high- and low-quality texts, adjusting its internal representations.<n>Our comprehensive evaluation across 115 languages demonstrates the improved performance of the proposed model.
Score: 4.239775815863115
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The use of large language models (LLMs) for evaluating outputs is becoming an increasingly effective and scalable approach. However, it remains uncertain whether this capability extends beyond task-specific evaluations to more general assessments of text quality, particularly in multilingual contexts. In this study, we introduce, MTQ-Eval, a novel framework for multilingual text quality evaluation that learns from examples of both high- and low-quality texts, adjusting its internal representations. To develop MTQ-Eval, we first automatically generate text quality preference data and then use it to train open-source base LLMs to align with ratings of high- and low-quality text. Our comprehensive evaluation across 115 languages demonstrates the improved performance of the proposed model. Upon further analysis, we find that this enhanced evaluation capability also leads to notable improvements in downstream tasks.

Related papers

Evaluating Extremely Low-Resource Machine Translation: A Comparative Study of ChrF++ and BLEU Metrics [69.2321983942375]
This work presents a comparative analysis of BLEU, an n-gram-based metric, and ChrF++, a character-based metric, for MT evaluation in ELRL settings.<n>We examine how each metric responds to translation artifacts, including hallucinations, repetition, source-text copying, and diacritic (textitmatra) variations across three ELRLs: Magahi, Bhojpuri, and Chhattisgarhi.<n>While recent work often relies solely on ChrF++, our findings show that BLEU, despite its lower absolute scores, provides complementary lexical-precision insights that improve interpretability.
arXiv Detail & Related papers (2026-02-19T14:56:42Z)
PerQ: Efficient Evaluation of Multilingual Text Personalization Quality [3.0156689030741]
Since no metrics are available to evaluate specific aspects of a text, such as its personalization quality, the researchers often rely solely on large language models to meta-evaluate such texts.<n>In this paper, a computationally efficient method for evaluation of personalization quality of a given text (generated by a language model) is introduced, called PerQ.
arXiv Detail & Related papers (2025-09-30T07:48:14Z)
Multi-Dimensional Machine Translation Evaluation: Model Evaluation and Resource for Korean [7.843029855730508]
We develop a 1200-sentence MQM evaluation benchmark for the language pair English-Korean. We find that reference-free setup outperforms its counterpart in the style dimension. Overall, RemBERT emerges as the most promising model.
arXiv Detail & Related papers (2024-03-19T12:02:38Z)
Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z)
Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study [63.27346930921658]
ChatGPT is capable of evaluating text quality effectively from various perspectives without reference. The Explicit Score, which utilizes ChatGPT to generate a numeric score measuring text quality, is the most effective and reliable method among the three exploited approaches.
arXiv Detail & Related papers (2023-04-03T05:29:58Z)
Large Language Models are Diverse Role-Players for Summarization Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal. Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions. We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z)
Measuring Uncertainty in Translation Quality Evaluation (TQE) [62.997667081978825]
This work carries out motivated research to correctly estimate the confidence intervals citeBrown_etal2001Interval depending on the sample size of the translated text. The methodology we applied for this work is from Bernoulli Statistical Distribution Modelling (BSDM) and Monte Carlo Sampling Analysis (MCSA)
arXiv Detail & Related papers (2021-11-15T12:09:08Z)
Translation Quality Assessment: A Brief Survey on Manual and Automatic Methods [9.210509295803243]
We present a high-level and concise survey of translation quality assessment (TQA) methods, including both manual judgement criteria and automated evaluation metrics. We hope that this work will be an asset for both translation model researchers and quality assessment researchers.
arXiv Detail & Related papers (2021-05-05T18:28:10Z)
TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing [73.16475763422446]
We propose a multilingual robustness evaluation platform for NLP tasks (TextFlint) It incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis. TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness.
arXiv Detail & Related papers (2021-03-21T17:20:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.