Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble
- URL: http://arxiv.org/abs/2110.06161v1
- Date: Tue, 12 Oct 2021 16:57:18 GMT
- Title: Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble
- Authors: Songyao Jiang, Bin Sun, Lichen Wang, Yue Bai, Kunpeng Li, Yun Fu
- Abstract summary: Sign language is commonly used by deaf or mute people to communicate.
We propose a novel Multi-modal Framework with a Global Ensemble Model (GEM) for isolated Sign Language Recognition ( SLR)
Our proposed SAM- SLR-v2 framework is exceedingly effective and achieves state-of-the-art performance with significant margins.
- Score: 71.97020373520922
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sign language is commonly used by deaf or mute people to communicate but
requires extensive effort to master. It is usually performed with the fast yet
delicate movement of hand gestures, body posture, and even facial expressions.
Current Sign Language Recognition (SLR) methods usually extract features via
deep neural networks and suffer overfitting due to limited and noisy data.
Recently, skeleton-based action recognition has attracted increasing attention
due to its subject-invariant and background-invariant nature, whereas
skeleton-based SLR is still under exploration due to the lack of hand
annotations. Some researchers have tried to use off-line hand pose trackers to
obtain hand keypoints and aid in recognizing sign language via recurrent neural
networks. Nevertheless, none of them outperforms RGB-based approaches yet. To
this end, we propose a novel Skeleton Aware Multi-modal Framework with a Global
Ensemble Model (GEM) for isolated SLR (SAM-SLR-v2) to learn and fuse
multi-modal feature representations towards a higher recognition rate.
Specifically, we propose a Sign Language Graph Convolution Network (SL-GCN) to
model the embedded dynamics of skeleton keypoints and a Separable
Spatial-Temporal Convolution Network (SSTCN) to exploit skeleton features. The
skeleton-based predictions are fused with other RGB and depth based modalities
by the proposed late-fusion GEM to provide global information and make a
faithful SLR prediction. Experiments on three isolated SLR datasets demonstrate
that our proposed SAM-SLR-v2 framework is exceedingly effective and achieves
state-of-the-art performance with significant margins. Our code will be
available at https://github.com/jackyjsy/SAM-SLR-v2
Related papers
- Bengali Sign Language Recognition through Hand Pose Estimation using Multi-Branch Spatial-Temporal Attention Model [0.5825410941577593]
We propose a spatial-temporal attention-based BSL recognition model considering hand joint skeletons extracted from the sequence of images.
Our model captures discriminative structural displacements and short-range dependency based on unified joint features projected onto high-dimensional feature space.
arXiv Detail & Related papers (2024-08-26T08:55:16Z) - Multi-Stream Keypoint Attention Network for Sign Language Recognition and Translation [3.976851945232775]
Current approaches for sign language recognition rely on RGB video inputs, which are vulnerable to fluctuations in the background.
We propose a multi-stream keypoint attention network to depict a sequence of keypoints produced by a readily available keypoint estimator.
We carry out comprehensive experiments on well-known benchmarks like Phoenix-2014, Phoenix-2014T, and CSL-Daily to showcase the efficacy of our methodology.
arXiv Detail & Related papers (2024-05-09T10:58:37Z) - Self-Sufficient Framework for Continuous Sign Language Recognition [75.60327502570242]
The goal of this work is to develop self-sufficient framework for Continuous Sign Language Recognition.
These include the need for complex multi-scale features such as hands, face, and mouth for understanding, and absence of frame-level annotations.
We propose Divide and Focus Convolution (DFConv) which extracts both manual and non-manual features without the need for additional networks or annotations.
DPLR propagates non-spiky frame-level pseudo-labels by combining the ground truth gloss sequence labels with the predicted sequence.
arXiv Detail & Related papers (2023-03-21T11:42:57Z) - StepNet: Spatial-temporal Part-aware Network for Isolated Sign Language Recognition [33.44126628779347]
We propose a new framework called Spatial-temporal Part-aware network(StepNet) based on RGB parts.
Part-level Spatial Modeling automatically captures the appearance-based properties, such as hands and faces, in the feature space.
Part-level Temporal Modeling implicitly mines the long-short term context to capture the relevant attributes over time.
arXiv Detail & Related papers (2022-12-25T05:24:08Z) - Joint-bone Fusion Graph Convolutional Network for Semi-supervised
Skeleton Action Recognition [65.78703941973183]
We propose a novel correlation-driven joint-bone fusion graph convolutional network (CD-JBF-GCN) as an encoder and use a pose prediction head as a decoder.
Specifically, the CD-JBF-GC can explore the motion transmission between the joint stream and the bone stream.
The pose prediction based auto-encoder in the self-supervised training stage allows the network to learn motion representation from unlabeled data.
arXiv Detail & Related papers (2022-02-08T16:03:15Z) - Multi-Modal Zero-Shot Sign Language Recognition [51.07720650677784]
We propose a multi-modal Zero-Shot Sign Language Recognition model.
A Transformer-based model along with a C3D model is used for hand detection and deep features extraction.
A semantic space is used to map the visual features to the lingual embedding of the class labels.
arXiv Detail & Related papers (2021-09-02T09:10:39Z) - Skeleton Based Sign Language Recognition Using Whole-body Keypoints [71.97020373520922]
Sign language is used by deaf or speech impaired people to communicate.
Skeleton-based recognition is becoming popular that it can be further ensembled with RGB-D based method to achieve state-of-the-art performance.
Inspired by the recent development of whole-body pose estimation citejin 2020whole, we propose recognizing sign language based on the whole-body key points and features.
arXiv Detail & Related papers (2021-03-16T03:38:17Z) - Pose-based Sign Language Recognition using GCN and BERT [0.0]
Word-level sign language recognition (WSLR) is the first important step towards understanding and interpreting sign language.
recognizing signs from videos is a challenging task as the meaning of a word depends on a combination of subtle body motions, hand configurations, and other movements.
Recent pose-based architectures for W SLR either model both the spatial and temporal dependencies among the poses in different frames simultaneously or only model the temporal information without fully utilizing the spatial information.
We tackle the problem of W SLR using a novel pose-based approach, which captures spatial and temporal information separately and performs late fusion.
arXiv Detail & Related papers (2020-12-01T19:10:50Z) - Fully Convolutional Networks for Continuous Sign Language Recognition [83.85895472824221]
Continuous sign language recognition is a challenging task that requires learning on both spatial and temporal dimensions.
We propose a fully convolutional network (FCN) for online SLR to concurrently learn spatial and temporal features from weakly annotated video sequences.
arXiv Detail & Related papers (2020-07-24T08:16:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.