Related papers: Beyond a Single Reference: Training and Evaluation with Paraphrases in Sign Language Translation

Beyond a Single Reference: Training and Evaluation with Paraphrases in Sign Language Translation

URL: http://arxiv.org/abs/2601.21128v1
Date: Thu, 29 Jan 2026 00:02:19 GMT
Title: Beyond a Single Reference: Training and Evaluation with Paraphrases in Sign Language Translation
Authors: Václav Javorek, Tomáš Železný, Alessa Carbo, Marek Hrúz, Ivan Gruber,
Abstract summary: Most Sign Language Translation (SLT) corpora pair each signed utterance with a single written-language reference.<n>This limitation constrains both model training and evaluation.<n>We introduce BLEUpara, an extension of BLEU that evaluates translations against multiple paraphrased references.
Score: 1.9102169745315323
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Most Sign Language Translation (SLT) corpora pair each signed utterance with a single written-language reference, despite the highly non-isomorphic relationship between sign and spoken languages, where multiple translations can be equally valid. This limitation constrains both model training and evaluation, particularly for n-gram-based metrics such as BLEU. In this work, we investigate the use of Large Language Models to automatically generate paraphrased variants of written-language translations as synthetic alternative references for SLT. First, we compare multiple paraphrasing strategies and models using an adapted ParaScore metric. Second, we study the impact of paraphrases on both training and evaluation of the pose-based T5 model on the YouTubeASL and How2Sign datasets. Our results show that naively incorporating paraphrases during training does not improve translation performance and can even be detrimental. In contrast, using paraphrases during evaluation leads to higher automatic scores and better alignment with human judgments. To formalize this observation, we introduce BLEUpara, an extension of BLEU that evaluates translations against multiple paraphrased references. Human evaluation confirms that BLEUpara correlates more strongly with perceived translation quality. We release all generated paraphrases, generation and evaluation code to support reproducible and more reliable evaluation of SLT systems.

Related papers

Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages [0.22009842278462158]
Cross-lingual evaluation of large language models (LLMs) typically conflates two sources of variance: genuine model performance differences and measurement instability.<n>We investigate evaluation reliability by holding generation conditions constant while varying target language.<n>Our findings suggest that zero-shot judge transfer is unreliable for discourse-level assessment in morphologically rich languages.
arXiv Detail & Related papers (2026-02-02T16:27:32Z)
SiLVERScore: Semantically-Aware Embeddings for Sign Language Generation Evaluation [29.960223851833785]
We propose SiLVERScore, a semantically-aware embedding-based evaluation metric for sign language generation.<n>On PHOENIX-14T and CSL-Daily datasets, SiLVERScore achieves near-perfect discrimination between correct and random pairs.
arXiv Detail & Related papers (2025-09-04T00:58:43Z)
Child-Directed Language Does Not Consistently Boost Syntax Learning in Language Models [5.636296752147828]
We show that language models trained on English Child-Directed Language (CDL) reach similar syntactic abilities as LMs trained on larger amounts of adult-directed written text.<n>We test this by comparing models trained on CDL vs. Wikipedia across two LM objectives (masked and causal), three languages (English, French, German), and three syntactic minimal-pair benchmarks.<n>Our results on these benchmarks show inconsistent benefits of CDL, which in most cases is outperformed by Wikipedia models.
arXiv Detail & Related papers (2025-05-29T17:25:36Z)
LLM-Based Evaluation of Low-Resource Machine Translation: A Reference-less Dialect Guided Approach with a Refined Sylheti-English Benchmark [1.3927943269211591]
We propose a comprehensive framework that enhances Large Language Models (LLMs)-based machine translation evaluation.<n>We extend the ONUBAD dataset by incorporating Sylheti-English sentence pairs, corresponding machine translations, and Direct Assessment (DA) scores annotated by native speakers.<n>Our evaluation shows that the proposed pipeline consistently outperforms existing methods, achieving the highest gain of +0.1083 in Spearman correlation.
arXiv Detail & Related papers (2025-05-18T07:24:13Z)
Unsupervised Approach to Evaluate Sentence-Level Fluency: Do We Really Need Reference? [3.2528685897001455]
This paper adapts an existing unsupervised technique for measuring text fluency without the need for any reference. Our approach leverages various word embeddings and trains language models using Recurrent Neural Network (RNN) architectures. To assess the performance of the models, we conduct a comparative analysis across 10 Indic languages.
arXiv Detail & Related papers (2023-12-03T20:09:23Z)
BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust Machine Translation Evaluation [12.407789866525079]
We show that by using additional information during training, such as sentence-level features and word-level tags, the trained metrics improve their capability to penalize translations with specific troublesome phenomena. We show that by using additional information during training, such as sentence-level features and word-level tags, the trained metrics improve their capability to penalize translations with specific troublesome phenomena.
arXiv Detail & Related papers (2023-05-30T15:50:46Z)
Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References [123.39034752499076]
Div-Ref is a method to enhance evaluation benchmarks by enriching the number of references. We conduct experiments to empirically demonstrate that diversifying the expression of reference can significantly enhance the correlation between automatic evaluation and human evaluation.
arXiv Detail & Related papers (2023-05-24T11:53:29Z)
Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs)<n>We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric.<n>Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z)
Translate to Disambiguate: Zero-shot Multilingual Word Sense Disambiguation with Pretrained Language Models [67.19567060894563]
Pretrained Language Models (PLMs) learn rich cross-lingual knowledge and can be finetuned to perform well on diverse tasks. We present a new study investigating how well PLMs capture cross-lingual word sense with Contextual Word-Level Translation (C-WLT) We find that as the model size increases, PLMs encode more cross-lingual word sense knowledge and better use context to improve WLT performance.
arXiv Detail & Related papers (2023-04-26T19:55:52Z)
On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks. We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments. We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z)
ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee. It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z)
Exemplar-Controllable Paraphrasing and Translation using Bitext [57.92051459102902]
We adapt models from prior work to be able to learn solely from bilingual text (bitext) Our single proposed model can perform four tasks: controlled paraphrase generation in both languages and controlled machine translation in both language directions.
arXiv Detail & Related papers (2020-10-12T17:02:50Z)
On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity. This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations. We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER. We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.