Pose-Based Sign Language Spotting via an End-to-End Encoder Architecture
- URL: http://arxiv.org/abs/2512.08738v1
- Date: Tue, 09 Dec 2025 15:49:23 GMT
- Title: Pose-Based Sign Language Spotting via an End-to-End Encoder Architecture
- Authors: Samuel Ebimobowei Johnny, Blessed Guda, Emmanuel Enejo Aaron, Assane Gueye,
- Abstract summary: We present a first step toward sign language retrieval by addressing the challenge of detecting the presence or absence of a query sign video.<n>Unlike conventional approaches that rely on intermediate gloss recognition or text-based matching, we propose an end-to-end model that directly operates on pose keypoints extracted from sign videos.<n>Our architecture employs an encoder-only backbone with a binary classification head to determine whether the query sign appears within the target sequence.
- Score: 0.4083182125683813
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic Sign Language Recognition (ASLR) has emerged as a vital field for bridging the gap between deaf and hearing communities. However, the problem of sign-to-sign retrieval or detecting a specific sign within a sequence of continuous signs remains largely unexplored. We define this novel task as Sign Language Spotting. In this paper, we present a first step toward sign language retrieval by addressing the challenge of detecting the presence or absence of a query sign video within a sentence-level gloss or sign video. Unlike conventional approaches that rely on intermediate gloss recognition or text-based matching, we propose an end-to-end model that directly operates on pose keypoints extracted from sign videos. Our architecture employs an encoder-only backbone with a binary classification head to determine whether the query sign appears within the target sequence. By focusing on pose representations instead of raw RGB frames, our method significantly reduces computational cost and mitigates visual noise. We evaluate our approach on the Word Presence Prediction dataset from the WSLP 2025 shared task, achieving 61.88\% accuracy and 60.00\% F1-score. These results demonstrate the effectiveness of our pose-based framework for Sign Language Spotting, establishing a strong foundation for future research in automatic sign language retrieval and verification. Code is available at https://github.com/EbimoJohnny/Pose-Based-Sign-Language-Spotting
Related papers
- SignX: The Foundation Model for Sign Recognition [28.651340554377906]
This paper proposes SignX, a foundation model framework for sign recognition.<n>It is a concise yet powerful framework applicable to multiple human activity recognition scenarios.<n> Experimental results show that SignX can recognize signs from sign language video, producing predicted gloss representations with greater accuracy than has been reported in prior work.
arXiv Detail & Related papers (2025-04-22T23:23:39Z) - Self-Supervised Representation Learning with Spatial-Temporal Consistency for Sign Language Recognition [96.62264528407863]
We propose a self-supervised contrastive learning framework to excavate rich context via spatial-temporal consistency.
Inspired by the complementary property of motion and joint modalities, we first introduce first-order motion information into sign language modeling.
Our method is evaluated with extensive experiments on four public benchmarks, and achieves new state-of-the-art performance with a notable margin.
arXiv Detail & Related papers (2024-06-15T04:50:19Z) - Improving Continuous Sign Language Recognition with Cross-Lingual Signs [29.077175863743484]
We study the feasibility of utilizing multilingual sign language corpora to facilitate continuous sign language recognition.
We first build two sign language dictionaries containing isolated signs that appear in two datasets.
Then we identify the sign-to-sign mappings between two sign languages via a well-optimized isolated sign language recognition model.
arXiv Detail & Related papers (2023-08-21T15:58:47Z) - Scaling up sign spotting through sign language dictionaries [99.50956498009094]
The focus of this work is $textitsign spotting$ - given a video of an isolated sign, our task is to identify $textitwhether$ and $textitwhere$ it has been signed in a continuous, co-articulated sign language video.
We train a model using multiple types of available supervision by: (1) $textitwatching$ existing footage which is sparsely labelled using mouthing cues; (2) $textitreading$ associated subtitles which provide additional translations of the signed content.
We validate the effectiveness of our approach on low
arXiv Detail & Related papers (2022-05-09T10:00:03Z) - Sign Language Video Retrieval with Free-Form Textual Queries [19.29003565494735]
We introduce the task of sign language retrieval with free-form textual queries.
The objective is to find the signing video in the collection that best matches the written query.
We propose SPOT-ALIGN, a framework for interleaving iterative rounds of sign spotting and feature alignment to expand the scope and scale of available training data.
arXiv Detail & Related papers (2022-01-07T15:22:18Z) - Skeleton Based Sign Language Recognition Using Whole-body Keypoints [71.97020373520922]
Sign language is used by deaf or speech impaired people to communicate.
Skeleton-based recognition is becoming popular that it can be further ensembled with RGB-D based method to achieve state-of-the-art performance.
Inspired by the recent development of whole-body pose estimation citejin 2020whole, we propose recognizing sign language based on the whole-body key points and features.
arXiv Detail & Related papers (2021-03-16T03:38:17Z) - Pose-based Sign Language Recognition using GCN and BERT [0.0]
Word-level sign language recognition (WSLR) is the first important step towards understanding and interpreting sign language.
recognizing signs from videos is a challenging task as the meaning of a word depends on a combination of subtle body motions, hand configurations, and other movements.
Recent pose-based architectures for W SLR either model both the spatial and temporal dependencies among the poses in different frames simultaneously or only model the temporal information without fully utilizing the spatial information.
We tackle the problem of W SLR using a novel pose-based approach, which captures spatial and temporal information separately and performs late fusion.
arXiv Detail & Related papers (2020-12-01T19:10:50Z) - Watch, read and lookup: learning to spot signs from multiple supervisors [99.50956498009094]
Given a video of an isolated sign, our task is to identify whether and where it has been signed in a continuous, co-articulated sign language video.
We train a model using multiple types of available supervision by: (1) watching existing sparsely labelled footage; (2) reading associated subtitles which provide additional weak-supervision; and (3) looking up words in visual sign language dictionaries.
These three tasks are integrated into a unified learning framework using the principles of Noise Contrastive Estimation and Multiple Instance Learning.
arXiv Detail & Related papers (2020-10-08T14:12:56Z) - BSL-1K: Scaling up co-articulated sign language recognition using
mouthing cues [106.21067543021887]
We show how to use mouthing cues from signers to obtain high-quality annotations from video data.
The BSL-1K dataset is a collection of British Sign Language (BSL) signs of unprecedented scale.
arXiv Detail & Related papers (2020-07-23T16:59:01Z) - Transferring Cross-domain Knowledge for Video Sign Language Recognition [103.9216648495958]
Word-level sign language recognition (WSLR) is a fundamental task in sign language interpretation.
We propose a novel method that learns domain-invariant visual concepts and fertilizes WSLR models by transferring knowledge of subtitled news sign to them.
arXiv Detail & Related papers (2020-03-08T03:05:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.