Hierarchical I3D for Sign Spotting
- URL: http://arxiv.org/abs/2210.00951v1
- Date: Mon, 3 Oct 2022 14:07:23 GMT
- Title: Hierarchical I3D for Sign Spotting
- Authors: Ryan Wong, Necati Cihan Camg\"oz, Richard Bowden
- Abstract summary: We focus on the challenging of Sign Spotting instead of Isolated Sign Language Recognition.
We propose a hierarchical sign spotting approach which learns coarse-to-finetemporal sign features.
We achieve a state-of-the-art 0.607 F1 score, which was the top-1 winning solution of the ChaLearn 2022 Sign Spotting Challenge.
- Score: 39.69485385546803
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Most of the vision-based sign language research to date has focused on
Isolated Sign Language Recognition (ISLR), where the objective is to predict a
single sign class given a short video clip. Although there has been significant
progress in ISLR, its real-life applications are limited. In this paper, we
focus on the challenging task of Sign Spotting instead, where the goal is to
simultaneously identify and localise signs in continuous co-articulated sign
videos. To address the limitations of current ISLR-based models, we propose a
hierarchical sign spotting approach which learns coarse-to-fine spatio-temporal
sign features to take advantage of representations at various temporal levels
and provide more precise sign localisation. Specifically, we develop
Hierarchical Sign I3D model (HS-I3D) which consists of a hierarchical network
head that is attached to the existing spatio-temporal I3D model to exploit
features at different layers of the network. We evaluate HS-I3D on the ChaLearn
2022 Sign Spotting Challenge - MSSL track and achieve a state-of-the-art 0.607
F1 score, which was the top-1 winning solution of the competition.
Related papers
- Real-Time American Sign Language Recognition Using 3D Convolutional Neural Networks and LSTM: Architecture, Training, and Deployment [0.0]
This paper presents a real-time American Sign Language (ASL) recognition system utilizing a hybrid deep learning architecture.<n>The system processes webcam video streams to recognize word-level ASL signs, addressing communication barriers for over 70 million deaf and hard-of-hearing individuals worldwide.
arXiv Detail & Related papers (2025-12-19T00:17:43Z) - Pose-Based Sign Language Spotting via an End-to-End Encoder Architecture [0.4083182125683813]
We present a first step toward sign language retrieval by addressing the challenge of detecting the presence or absence of a query sign video.<n>Unlike conventional approaches that rely on intermediate gloss recognition or text-based matching, we propose an end-to-end model that directly operates on pose keypoints extracted from sign videos.<n>Our architecture employs an encoder-only backbone with a binary classification head to determine whether the query sign appears within the target sequence.
arXiv Detail & Related papers (2025-12-09T15:49:23Z) - Exploiting Ensemble Learning for Cross-View Isolated Sign Language Recognition [14.547488459868442]
We present our solution to the Cross-View Isolated Sign Language Recognition (CV-I SLR) challenge held at WWW 2025.
CV-I SLR addresses a critical issue in traditional Isolated Sign Language Recognition (I SLR), where existing datasets predominantly capture sign language videos from a frontal perspective.
Our solution ranked 3rd in both the RGB-based I SLR and RGB-D-based I SLR tracks, demonstrating the effectiveness in handling the challenges of cross-view recognition.
arXiv Detail & Related papers (2025-02-04T10:21:28Z) - MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition [94.56755080185732]
We propose a Motion-Aware masked autoencoder with Semantic Alignment (MASA) that integrates rich motion cues and global semantic information.
Our framework can simultaneously learn local motion cues and global semantic features for comprehensive sign language representation.
arXiv Detail & Related papers (2024-05-31T08:06:05Z) - Enhancing Brazilian Sign Language Recognition through Skeleton Image Representation [2.6311088262657907]
This work proposes an Isolated Sign Language Recognition (ISLR) approach where body, hands, and facial landmarks are extracted throughout time and encoded as 2-D images.
We show that our method surpassed the state-of-the-art in terms of performance metrics on two widely recognized datasets in Brazilian Sign Language (LIBRAS)
In addition to being more accurate, our method is more time-efficient and easier to train due to its reliance on a simpler network architecture and solely RGB data as input.
arXiv Detail & Related papers (2024-04-29T23:21:17Z) - SignBERT+: Hand-model-aware Self-supervised Pre-training for Sign
Language Understanding [132.78015553111234]
Hand gesture serves as a crucial role during the expression of sign language.
Current deep learning based methods for sign language understanding (SLU) are prone to over-fitting due to insufficient sign data resource.
We propose the first self-supervised pre-trainable SignBERT+ framework with model-aware hand prior incorporated.
arXiv Detail & Related papers (2023-05-08T17:16:38Z) - Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble [71.97020373520922]
Sign language is commonly used by deaf or mute people to communicate.
We propose a novel Multi-modal Framework with a Global Ensemble Model (GEM) for isolated Sign Language Recognition ( SLR)
Our proposed SAM- SLR-v2 framework is exceedingly effective and achieves state-of-the-art performance with significant margins.
arXiv Detail & Related papers (2021-10-12T16:57:18Z) - Multi-Modal Zero-Shot Sign Language Recognition [51.07720650677784]
We propose a multi-modal Zero-Shot Sign Language Recognition model.
A Transformer-based model along with a C3D model is used for hand detection and deep features extraction.
A semantic space is used to map the visual features to the lingual embedding of the class labels.
arXiv Detail & Related papers (2021-09-02T09:10:39Z) - Looking for the Signs: Identifying Isolated Sign Instances in Continuous
Video Footage [45.29710323525548]
We propose a transformer-based network, called SignLookup, to extract-temporal representations from video clips.
Our model achieves state-of-the-art performance on the sign spotting task with accuracy as high as 96% on challenging benchmark datasets.
arXiv Detail & Related papers (2021-07-21T12:49:44Z) - Pose-based Sign Language Recognition using GCN and BERT [0.0]
Word-level sign language recognition (WSLR) is the first important step towards understanding and interpreting sign language.
recognizing signs from videos is a challenging task as the meaning of a word depends on a combination of subtle body motions, hand configurations, and other movements.
Recent pose-based architectures for W SLR either model both the spatial and temporal dependencies among the poses in different frames simultaneously or only model the temporal information without fully utilizing the spatial information.
We tackle the problem of W SLR using a novel pose-based approach, which captures spatial and temporal information separately and performs late fusion.
arXiv Detail & Related papers (2020-12-01T19:10:50Z) - Transferring Cross-domain Knowledge for Video Sign Language Recognition [103.9216648495958]
Word-level sign language recognition (WSLR) is a fundamental task in sign language interpretation.
We propose a novel method that learns domain-invariant visual concepts and fertilizes WSLR models by transferring knowledge of subtitled news sign to them.
arXiv Detail & Related papers (2020-03-08T03:05:21Z) - Video-based Sign Language Recognition without Temporal Segmentation [88.03159640595187]
We propose a novel continuous sign recognition framework, which eliminates the preprocessing of temporal segmentation.<n>The proposed LS-HAN consists of three components: a two-stream Convolutional Neural Network (CNN) for video feature representation generation, a Latent Space for semantic gap bridging, and a Hierarchical Attention Network (HAN) for latent space based recognition.
arXiv Detail & Related papers (2018-01-30T17:37:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.