DETECT: Determining Ease and Textual Clarity of German Text Simplifications
- URL: http://arxiv.org/abs/2510.22212v1
- Date: Sat, 25 Oct 2025 08:20:18 GMT
- Title: DETECT: Determining Ease and Textual Clarity of German Text Simplifications
- Authors: Maria Korobeynikova, Alessia Battisti, Lukas Fischer, Yingqiang Gao,
- Abstract summary: DETECT is the first German-specific metric that holistically evaluates ATS quality across all three dimensions of simplicity, meaning preservation, and fluency.<n>We construct the largest German human evaluation dataset for text simplification to validate our metric directly.
- Score: 4.005744004522348
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current evaluation of German automatic text simplification (ATS) relies on general-purpose metrics such as SARI, BLEU, and BERTScore, which insufficiently capture simplification quality in terms of simplicity, meaning preservation, and fluency. While specialized metrics like LENS have been developed for English, corresponding efforts for German have lagged behind due to the absence of human-annotated corpora. To close this gap, we introduce DETECT, the first German-specific metric that holistically evaluates ATS quality across all three dimensions of simplicity, meaning preservation, and fluency, and is trained entirely on synthetic large language model (LLM) responses. Our approach adapts the LENS framework to German and extends it with (i) a pipeline for generating synthetic quality scores via LLMs, enabling dataset creation without human annotation, and (ii) an LLM-based refinement step for aligning grading criteria with simplification requirements. To the best of our knowledge, we also construct the largest German human evaluation dataset for text simplification to validate our metric directly. Experimental results show that DETECT achieves substantially higher correlations with human judgments than widely used ATS metrics, with particularly strong gains in meaning preservation and fluency. Beyond ATS, our findings highlight both the potential and the limitations of LLMs for automatic evaluation and provide transferable guidelines for general language accessibility tasks.
Related papers
- Domain-Adaptation through Synthetic Data: Fine-Tuning Large Language Models for German Law [4.0979083977801105]
Large language models (LLMs) often struggle in specialized domains such as legal reasoning due to limited expert knowledge.<n>This paper presents an effective method for adapting advanced LLMs to German legal question answering through a novel synthetic data generation approach.
arXiv Detail & Related papers (2026-01-20T17:11:51Z) - Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis [4.061135251278187]
TrGLUE is a benchmark for evaluating natural language understanding in Turkish.<n>We also present SentiTurca, a specialized benchmark for sentiment analysis.<n> TrGLUE comprises Turkish-native corpora curated to mirror the domains and task formulations of GLUE-style evaluations.
arXiv Detail & Related papers (2025-12-26T18:02:09Z) - A Critical Study of Automatic Evaluation in Sign Language Translation [17.083206782232185]
It remains unclear to what extent text-based metrics can reliably capture the quality of sign language translation (SLT) outputs.<n>We analyze six metrics, including BLEU, chrF, and ROUGE, as well as BLEURT on the one hand, and large language model (LLM)-based evaluators such as G-Eval and GEMBA zero-shot direct assessment on the other.
arXiv Detail & Related papers (2025-10-29T11:57:03Z) - Inclusive Easy-to-Read Generation for Individuals with Cognitive Impairments [2.1481398044731574]
We introduce ETR-fr, the first dataset for ETR text generation fully compliant with European ETR guidelines.<n>We implement parameter-efficient fine-tuning on PLMs and LLMs to establish generative baselines.<n>Overall results show that PLMs perform comparably to LLMs and adapt effectively to out-of-domain texts.
arXiv Detail & Related papers (2025-10-01T09:13:18Z) - Language Bottleneck Models: A Framework for Interpretable Knowledge Tracing and Beyond [55.984684518346924]
We recast Knowledge Tracing as an inverse problem: learning the minimum natural-language summary that makes past answers explainable and future answers predictable.<n>Our Language Bottleneck Model (LBM) consists of an encoder LLM that writes an interpretable knowledge summary and a frozen decoder LLM that must reconstruct and predict student responses using only that summary text.<n> Experiments on synthetic arithmetic benchmarks and the large-scale Eedi dataset show that LBMs rival the accuracy of state-of-the-art KT and direct LLM methods while requiring orders-of-magnitude fewer student trajectories.
arXiv Detail & Related papers (2025-06-20T13:21:14Z) - CATER: Leveraging LLM to Pioneer a Multidimensional, Reference-Independent Paradigm in Translation Quality Evaluation [0.0]
Comprehensive AI-assisted Translation Edit Ratio (CATER) is a novel framework for evaluating machine translation (MT) quality.<n>Uses large language models (LLMs) via a carefully designed prompt-based protocol.
arXiv Detail & Related papers (2024-12-15T17:45:34Z) - Towards Understanding the Robustness of LLM-based Evaluations under Perturbations [9.944512689015998]
Large Language Models (LLMs) can serve as automatic evaluators for non-standardized metrics in summarization and dialog-based tasks.<n>We conduct experiments across multiple prompting strategies to examine how LLMs fare as quality evaluators when compared with human judgments.
arXiv Detail & Related papers (2024-12-12T13:31:58Z) - RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation.
Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries [56.31117605097345]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.<n>Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.<n>AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z) - Improving Translation Faithfulness of Large Language Models via
Augmenting Instructions [89.76691340615848]
We propose SWIE (Segment-Weighted Instruction Embedding) and an instruction-following dataset OVERMISS.
SWIE improves the model instruction understanding by adding a global instruction representation on the following input and response representations.
OVERMISS improves model faithfulness by comparing over-translation and miss-translation results with the correct translation.
arXiv Detail & Related papers (2023-08-24T09:32:29Z) - Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning.
They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health.
Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z) - INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained
Feedback [80.57617091714448]
We present InstructScore, an explainable evaluation metric for text generation.
We fine-tune a text evaluation metric based on LLaMA, producing a score for generated text and a human readable diagnostic report.
arXiv Detail & Related papers (2023-05-23T17:27:22Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.