SiLVERScore: Semantically-Aware Embeddings for Sign Language Generation Evaluation
- URL: http://arxiv.org/abs/2509.03791v1
- Date: Thu, 04 Sep 2025 00:58:43 GMT
- Title: SiLVERScore: Semantically-Aware Embeddings for Sign Language Generation Evaluation
- Authors: Saki Imai, Mert İnan, Anthony Sicilia, Malihe Alikhani,
- Abstract summary: We propose SiLVERScore, a semantically-aware embedding-based evaluation metric for sign language generation.<n>On PHOENIX-14T and CSL-Daily datasets, SiLVERScore achieves near-perfect discrimination between correct and random pairs.
- Score: 29.960223851833785
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluating sign language generation is often done through back-translation, where generated signs are first recognized back to text and then compared to a reference using text-based metrics. However, this two-step evaluation pipeline introduces ambiguity: it not only fails to capture the multimodal nature of sign language-such as facial expressions, spatial grammar, and prosody-but also makes it hard to pinpoint whether evaluation errors come from sign generation model or the translation system used to assess it. In this work, we propose SiLVERScore, a novel semantically-aware embedding-based evaluation metric that assesses sign language generation in a joint embedding space. Our contributions include: (1) identifying limitations of existing metrics, (2) introducing SiLVERScore for semantically-aware evaluation, (3) demonstrating its robustness to semantic and prosodic variations, and (4) exploring generalization challenges across datasets. On PHOENIX-14T and CSL-Daily datasets, SiLVERScore achieves near-perfect discrimination between correct and random pairs (ROC AUC = 0.99, overlap < 7%), substantially outperforming traditional metrics.
Related papers
- Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages [0.22009842278462158]
Cross-lingual evaluation of large language models (LLMs) typically conflates two sources of variance: genuine model performance differences and measurement instability.<n>We investigate evaluation reliability by holding generation conditions constant while varying target language.<n>Our findings suggest that zero-shot judge transfer is unreliable for discourse-level assessment in morphologically rich languages.
arXiv Detail & Related papers (2026-02-02T16:27:32Z) - Beyond a Single Reference: Training and Evaluation with Paraphrases in Sign Language Translation [1.9102169745315323]
Most Sign Language Translation (SLT) corpora pair each signed utterance with a single written-language reference.<n>This limitation constrains both model training and evaluation.<n>We introduce BLEUpara, an extension of BLEU that evaluates translations against multiple paraphrased references.
arXiv Detail & Related papers (2026-01-29T00:02:19Z) - AQAScore: Evaluating Semantic Alignment in Text-to-Audio Generation via Audio Question Answering [97.52852990265136]
We introduce AQAScore, a backbone-agnostic evaluation framework that leverages the reasoning capabilities of audio-aware large language models.<n>We evaluate AQAScore across multiple benchmarks, including human-rated relevance, pairwise comparison, and compositional reasoning tasks.
arXiv Detail & Related papers (2026-01-21T07:35:36Z) - SCORE: A Semantic Evaluation Framework for Generative Document Parsing [2.5101597298392098]
Multi-modal generative document parsing systems produce semantically correct yet structurally divergent outputs.<n>Conventional metrics-CER, WER, IoU, or TEDS-misclassify such diversity as error, penalizing valid interpretations and obscuring system behavior.<n>We introduce SCORE, an interpretation-agnostic framework that integrates (i) adjusted edit distance for robust content fidelity, (ii) token-level diagnostics to distinguish hallucinations from omissions, (iii) table evaluation with spatial tolerance and semantic alignment, and (iv) hierarchy-aware consistency checks.
arXiv Detail & Related papers (2025-09-16T16:06:19Z) - CAAD: Context-Aware Adaptive Decoding for Truthful Text Generation [31.469511576774252]
We propose a context-aware adaptive decoding method for large language models.<n>Our approach achieves a 2.8 percent average improvement on TruthfulQA.<n>Our model-agnostic, scalable, and efficient method requires only a single generation pass.
arXiv Detail & Related papers (2025-08-04T08:28:25Z) - A Benchmark of French ASR Systems Based on Error Severity [6.657432034629865]
A new evaluation is proposed that categorizes errors into four levels of severity.<n>This metric is applied to a benchmark of 10 state-of-the-art ASR systems on French language.
arXiv Detail & Related papers (2025-01-18T21:07:18Z) - Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator [55.94334001112357]
We introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs.<n>We propose a retrieval-enhanced SLG approach, which incorporates external sign dictionaries to provide accurate word-level signs.
arXiv Detail & Related papers (2024-11-26T18:28:09Z) - signwriting-evaluation: Effective Sign Language Evaluation via SignWriting [3.484261625026626]
This paper introduces a comprehensive suite of evaluation metrics specifically designed for SignWriting.
We address the challenges of evaluating single signs versus continuous signing.
Our findings reveal the strengths and limitations of each metric, offering valuable insights for future advancements.
arXiv Detail & Related papers (2024-10-17T15:28:45Z) - MS2SL: Multimodal Spoken Data-Driven Continuous Sign Language Production [93.32354378820648]
We propose a unified framework for continuous sign language production, easing communication between sign and non-sign language users.
A sequence diffusion model, utilizing embeddings extracted from text or speech, is crafted to generate sign predictions step by step.
Experiments on How2Sign and PHOENIX14T datasets demonstrate that our model achieves competitive performance in sign language production.
arXiv Detail & Related papers (2024-07-04T13:53:50Z) - Learnable Item Tokenization for Generative Recommendation [78.30417863309061]
We propose LETTER (a LEarnable Tokenizer for generaTivE Recommendation), which integrates hierarchical semantics, collaborative signals, and code assignment diversity.
LETTER incorporates Residual Quantized VAE for semantic regularization, a contrastive alignment loss for collaborative regularization, and a diversity loss to mitigate code assignment bias.
arXiv Detail & Related papers (2024-05-12T15:49:38Z) - APPLS: Evaluating Evaluation Metrics for Plain Language Summarization [18.379461020500525]
This study introduces a granular meta-evaluation testbed, APPLS, designed to evaluate metrics for Plain Language Summarization (PLS)<n>We identify four PLS criteria from previous work and define a set of perturbations corresponding to these criteria that sensitive metrics should be able to detect.<n>Using APPLS, we assess performance of 14 metrics, including automated scores, lexical features, and LLM prompt-based evaluations.
arXiv Detail & Related papers (2023-05-23T17:59:19Z) - INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained
Feedback [80.57617091714448]
We present InstructScore, an explainable evaluation metric for text generation.
We fine-tune a text evaluation metric based on LLaMA, producing a score for generated text and a human readable diagnostic report.
arXiv Detail & Related papers (2023-05-23T17:27:22Z) - TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z) - On the Limitations of Cross-lingual Encoders as Exposed by
Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity.
This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations.
We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER.
We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.