Related papers: Adapting AlignScore Mertic for Factual Consistency Evaluation of Text in Russian: A Student Abstract

Adapting AlignScore Mertic for Factual Consistency Evaluation of Text in Russian: A Student Abstract

URL: http://arxiv.org/abs/2512.06586v1
Date: Sat, 06 Dec 2025 22:44:51 GMT
Title: Adapting AlignScore Mertic for Factual Consistency Evaluation of Text in Russian: A Student Abstract
Authors: Mikhail Zimin, Milyausha Shamsutdinova, Georgii Andriushchenko,
Abstract summary: We introduce AlignRuScore, a comprehensive adaptation of the AlignScore metric for Russian.<n>We fine-tuned a RuBERT-based alignment model with task-specific classification and regression heads on Russian and translated English datasets.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Ensuring factual consistency in generated text is crucial for reliable natural language processing applications. However, there is a lack of evaluation tools for factual consistency in Russian texts, as existing tools primarily focus on English corpora. To bridge this gap, we introduce AlignRuScore, a comprehensive adaptation of the AlignScore metric for Russian. To adapt the metric, we fine-tuned a RuBERT-based alignment model with task-specific classification and regression heads on Russian and translated English datasets. Our results demonstrate that a unified alignment metric can be successfully ported to Russian, laying the groundwork for robust multilingual factual consistency evaluation. We release the translated corpora, model checkpoints, and code to support further research.

Related papers

Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation [57.11989521509119]
We propose a novel agentic translation evaluation framework, centered by a reflective Core Agent that invokes specialized sub-agents.<n> Experimental results indicate the efficacy of RATE, achieving an improvement of at least 3.2 meta score compared with current metrics.
arXiv Detail & Related papers (2026-01-12T09:03:42Z)
Hacking Neural Evaluation Metrics with Single Hub Text [6.572810068286891]
We propose a method for finding a single adversarial text in the discrete space that is consistently evaluated as high-quality.<n>The method achieves 79.1 COMET% and 67.8 COMET% in the WMT'24 English-to-Japanese (En--Ja) and English-to-German (En--De) translation tasks, respectively.<n>We also confirmed that the hub text found with our method generalizes across multiple language pairs such as Ja--En and De--En.
arXiv Detail & Related papers (2025-12-18T09:06:24Z)
Is Context Helpful for Chat Translation Evaluation? [23.440392979857247]
We conduct a meta-evaluation of existing sentence-level automatic metrics to assess the quality of machine-translated chats. We find that reference-free metrics lag behind reference-based ones, especially when evaluating translation quality in out-of-English settings. We propose a new evaluation metric, Context-MQM, that utilizes bilingual context with a large language model.
arXiv Detail & Related papers (2024-03-13T07:49:50Z)
Cross-lingual Cross-temporal Summarization: Dataset, Models, Evaluation [20.675242617417677]
Cross-lingual cross-temporal summarization (CLCTS) is a largely unexplored area that has the potential to improve cross-cultural accessibility and understanding. This paper comprehensively addresses the CLCTS task, including dataset creation, modeling, and evaluation.
arXiv Detail & Related papers (2023-06-22T14:31:18Z)
ICE-Score: Instructing Large Language Models to Evaluate Code [7.556444391696562]
We propose textttICE-Score, a new evaluation metric via instructing large language models for code assessments. Our metric addresses the limitations of existing approaches by achieving superior correlations with functional correctness and human preferences. Our results demonstrate that our metric surpasses state-of-the-art metrics for code generation.
arXiv Detail & Related papers (2023-04-27T16:38:17Z)
Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models [57.80514758695275]
Using large language models (LLMs) for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level. We propose a new prompting method called textbftextttError Analysis Prompting (EAPrompt) This technique emulates the commonly accepted human evaluation framework - Multidimensional Quality Metrics (MQM) and textitproduces explainable and reliable MT evaluations at both the system and segment level.
arXiv Detail & Related papers (2023-03-24T05:05:03Z)
SESCORE2: Learning Text Generation Evaluation via Synthesizing Realistic Mistakes [93.19166902594168]
We propose SESCORE2, a self-supervised approach for training a model-based metric for text generation evaluation. Key concept is to synthesize realistic model mistakes by perturbing sentences retrieved from a corpus. We evaluate SESCORE2 and previous methods on four text generation tasks across three languages.
arXiv Detail & Related papers (2022-12-19T09:02:16Z)
Russian SuperGLUE 1.1: Revising the Lessons not Learned by Russian NLP models [53.95094814056337]
This paper presents Russian SuperGLUE 1.1, an updated benchmark styled after GLUE for Russian NLP models. The new version includes a number of technical, user experience and methodological improvements. We provide the integration of Russian SuperGLUE with a framework for industrial evaluation of the open-source models, MOROCCO.
arXiv Detail & Related papers (2022-02-15T23:45:30Z)
Methods for Detoxification of Texts for the Russian Language [55.337471467610094]
We introduce the first study of automatic detoxification of Russian texts to combat offensive language. We test two types of models - unsupervised approach that performs local corrections and supervised approach based on pretrained language GPT-2 model. The results show that the tested approaches can be successfully used for detoxification, although there is room for improvement.
arXiv Detail & Related papers (2021-05-19T10:37:44Z)
TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing [73.16475763422446]
We propose a multilingual robustness evaluation platform for NLP tasks (TextFlint) It incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis. TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness.
arXiv Detail & Related papers (2021-03-21T17:20:38Z)
RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark [5.258267224004844]
We introduce an advanced Russian general language understanding evaluation benchmark -- RussianGLUE. For the first time, a benchmark of nine tasks, collected and organized analogically to the SuperGLUE methodology, was developed from scratch for the Russian language.
arXiv Detail & Related papers (2020-10-29T20:31:39Z)
GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics. Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation. It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.