Meaningful Pose-Based Sign Language Evaluation
- URL: http://arxiv.org/abs/2510.07453v1
- Date: Wed, 08 Oct 2025 19:00:24 GMT
- Title: Meaningful Pose-Based Sign Language Evaluation
- Authors: Zifan Jiang, Colin Leong, Amit Moryossef, Anne Göhring, Annette Rios, Oliver Cory, Maksym Ivashechkin, Neha Tarigopula, Biao Zhang, Rico Sennrich, Sarah Ebling,
- Abstract summary: The study covers keypoint distance-based, embedding-based, and back-translation-based metrics.<n>We show tradeoffs between different metrics in different scenarios through automatic meta-evaluation of sign-level retrieval and a human correlation study of text-to-pose translation across different sign languages.
- Score: 29.030154300749086
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a comprehensive study on meaningfully evaluating sign language utterances in the form of human skeletal poses. The study covers keypoint distance-based, embedding-based, and back-translation-based metrics. We show tradeoffs between different metrics in different scenarios through automatic meta-evaluation of sign-level retrieval and a human correlation study of text-to-pose translation across different sign languages. Our findings and the open-source pose-evaluation toolkit provide a practical and reproducible way of developing and evaluating sign language translation or generation systems.
Related papers
- On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation [88.77441715819366]
Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content.<n>We propose a variety of likelihood- and generative-based evaluation methods that serve in place of naive global token perplexity.
arXiv Detail & Related papers (2026-01-09T22:01:56Z) - signwriting-evaluation: Effective Sign Language Evaluation via SignWriting [3.484261625026626]
This paper introduces a comprehensive suite of evaluation metrics specifically designed for SignWriting.
We address the challenges of evaluating single signs versus continuous signing.
Our findings reveal the strengths and limitations of each metric, offering valuable insights for future advancements.
arXiv Detail & Related papers (2024-10-17T15:28:45Z) - A Comparative Study of Continuous Sign Language Recognition Techniques [1.534667887016089]
Continuous Sign Language Recognition (CSLR) focuses on the interpretation of a sequence of sign language gestures performed continually without pauses.
In this study, we conduct an empirical evaluation of recent deep learning C SLR techniques and assess their performance across various datasets and sign languages.
arXiv Detail & Related papers (2024-06-18T07:51:44Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - On the Importance of Signer Overlap for Sign Language Detection [65.26091369630547]
We argue that the current benchmark data sets for sign language detection estimate overly positive results that do not generalize well.
We quantify this with a detailed analysis of the effect of signer overlap on current sign detection benchmark data sets.
We propose new data set partitions that are free of overlap and allow for more realistic performance assessment.
arXiv Detail & Related papers (2023-03-19T22:15:05Z) - Multilingual Extraction and Categorization of Lexical Collocations with
Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context.
Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - Does Summary Evaluation Survive Translation to Other Languages? [0.0]
We translate an existing English summarization dataset, SummEval dataset, to four different languages.
We analyze the scores from the automatic evaluation metrics in translated languages, as well as their correlation with human annotations in the source language.
arXiv Detail & Related papers (2021-09-16T17:35:01Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Evaluating the Immediate Applicability of Pose Estimation for Sign
Language Recognition [33.26064598621083]
We evaluate representations based on skeleton poses, as these are explainable, person-independent, privacy-preserving, low-dimensional representations.
We perform two independent studies using two state-of-the-art pose estimation systems.
We analyze the applicability of the pose estimation systems to sign language recognition by evaluating the failure cases of the recognition models.
arXiv Detail & Related papers (2021-04-20T14:41:45Z) - LCEval: Learned Composite Metric for Caption Evaluation [37.2313913156926]
We propose a neural network-based learned metric to improve the caption-level caption evaluation.
This paper investigates the relationship between different linguistic features and the caption-level correlation of the learned metrics.
Our proposed metric not only outperforms the existing metrics in terms of caption-level correlation but it also shows a strong system-level correlation against human assessments.
arXiv Detail & Related papers (2020-12-24T06:38:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.