Score-level Multi Cue Fusion for Sign Language Recognition
- URL: http://arxiv.org/abs/2009.14139v1
- Date: Tue, 29 Sep 2020 16:32:51 GMT
- Title: Score-level Multi Cue Fusion for Sign Language Recognition
- Authors: \c{C}a\u{g}r{\i} G\"ok\c{c}e and O\u{g}ulcan \"Ozdemir and Ahmet Alp
K{\i}nd{\i}ro\u{g}lu and Lale Akarun
- Abstract summary: We propose a more straightforward approach to training cue models for Sign Language Recognition.
We compare the performance of 3D Convolutional Neural Network (CNN) models specializing in dominant hand, hands, face, and upper body regions.
Our experimental results have shown the effectiveness of mixed convolutional models.
- Score: 2.064612766965483
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sign Languages are expressed through hand and upper body gestures as well as
facial expressions. Therefore, Sign Language Recognition (SLR) needs to focus
on all such cues. Previous work uses hand-crafted mechanisms or network
aggregation to extract the different cue features, to increase SLR performance.
This is slow and involves complicated architectures. We propose a more
straightforward approach that focuses on training separate cue models
specializing on the dominant hand, hands, face, and upper body regions. We
compare the performance of 3D Convolutional Neural Network (CNN) models
specializing in these regions, combine them through score-level fusion, and use
the weighted alternative. Our experimental results have shown the effectiveness
of mixed convolutional models. Their fusion yields up to 19% accuracy
improvement over the baseline using the full upper body. Furthermore, we
include a discussion for fusion settings, which can help future work on Sign
Language Translation (SLT).
Related papers
- Sign Language Sense Disambiguation [0.0]
This project explores methods to enhance sign language translation of German sign language, specifically focusing on homonyms.
We approach the improvement by training transformer-based models on various bodypart representations to shift the focus on said bodypart.
The results show that focusing on the mouth increases the performance in small dataset settings while shifting the focus on the hands retrieves better results in larger dataset settings.
arXiv Detail & Related papers (2024-09-13T12:36:52Z) - Multi-Stream Keypoint Attention Network for Sign Language Recognition and Translation [3.976851945232775]
Current approaches for sign language recognition rely on RGB video inputs, which are vulnerable to fluctuations in the background.
We propose a multi-stream keypoint attention network to depict a sequence of keypoints produced by a readily available keypoint estimator.
We carry out comprehensive experiments on well-known benchmarks like Phoenix-2014, Phoenix-2014T, and CSL-Daily to showcase the efficacy of our methodology.
arXiv Detail & Related papers (2024-05-09T10:58:37Z) - SignDiff: Diffusion Models for American Sign Language Production [23.82668888574089]
We propose a dual-condition diffusion pre-training model named SignDiff that can generate human sign language speakers from a skeleton pose.
We also propose a new method for American Sign Language Production (ASLP), which can generate ASL skeletal pose videos from text input.
arXiv Detail & Related papers (2023-08-30T15:14:56Z) - MMLatch: Bottom-up Top-down Fusion for Multimodal Sentiment Analysis [84.7287684402508]
Current deep learning approaches for multimodal fusion rely on bottom-up fusion of high and mid-level latent modality representations.
Models of human perception highlight the importance of top-down fusion, where high-level representations affect the way sensory inputs are perceived.
We propose a neural architecture that captures top-down cross-modal interactions, using a feedback mechanism in the forward pass during network training.
arXiv Detail & Related papers (2022-01-24T17:48:04Z) - Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble [71.97020373520922]
Sign language is commonly used by deaf or mute people to communicate.
We propose a novel Multi-modal Framework with a Global Ensemble Model (GEM) for isolated Sign Language Recognition ( SLR)
Our proposed SAM- SLR-v2 framework is exceedingly effective and achieves state-of-the-art performance with significant margins.
arXiv Detail & Related papers (2021-10-12T16:57:18Z) - Multi-Modal Zero-Shot Sign Language Recognition [51.07720650677784]
We propose a multi-modal Zero-Shot Sign Language Recognition model.
A Transformer-based model along with a C3D model is used for hand detection and deep features extraction.
A semantic space is used to map the visual features to the lingual embedding of the class labels.
arXiv Detail & Related papers (2021-09-02T09:10:39Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z) - Read Like Humans: Autonomous, Bidirectional and Iterative Language
Modeling for Scene Text Recognition [80.446770909975]
Linguistic knowledge is of great benefit to scene text recognition.
How to effectively model linguistic rules in end-to-end deep networks remains a research challenge.
We propose an autonomous, bidirectional and iterative ABINet for scene text recognition.
arXiv Detail & Related papers (2021-03-11T06:47:45Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z) - FineHand: Learning Hand Shapes for American Sign Language Recognition [16.862375555609667]
We present an approach for effective learning of hand shape embeddings, which are discriminative for ASL gestures.
For hand shape recognition our method uses a mix of manually labelled hand shapes and high confidence predictions to train deep convolutional neural network (CNN)
We will demonstrate that higher quality hand shape models can significantly improve the accuracy of final video gesture classification.
arXiv Detail & Related papers (2020-03-04T23:32:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.