Continuous Sign Language Recognition with Adapted Conformer via Unsupervised Pretraining
- URL: http://arxiv.org/abs/2405.12018v1
- Date: Mon, 20 May 2024 13:40:52 GMT
- Title: Continuous Sign Language Recognition with Adapted Conformer via Unsupervised Pretraining
- Authors: Neena Aloysius, Geetha M, Prema Nedungadi,
- Abstract summary: State-of-the-art Conformer model for Speech Recognition is adapted for continuous sign language recognition.
This marks the first instance of employing Conformer for a vision-based task.
Unsupervised pretraining is conducted on a curated sign language dataset.
- Score: 0.6144680854063939
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conventional Deep Learning frameworks for continuous sign language recognition (CSLR) are comprised of a single or multi-modal feature extractor, a sequence-learning module, and a decoder for outputting the glosses. The sequence learning module is a crucial part wherein transformers have demonstrated their efficacy in the sequence-to-sequence tasks. Analyzing the research progress in the field of Natural Language Processing and Speech Recognition, a rapid introduction of various transformer variants is observed. However, in the realm of sign language, experimentation in the sequence learning component is limited. In this work, the state-of-the-art Conformer model for Speech Recognition is adapted for CSLR and the proposed model is termed ConSignformer. This marks the first instance of employing Conformer for a vision-based task. ConSignformer has bimodal pipeline of CNN as feature extractor and Conformer for sequence learning. For improved context learning we also introduce Cross-Modal Relative Attention (CMRA). By incorporating CMRA into the model, it becomes more adept at learning and utilizing complex relationships within the data. To further enhance the Conformer model, unsupervised pretraining called Regressional Feature Extraction is conducted on a curated sign language dataset. The pretrained Conformer is then fine-tuned for the downstream recognition task. The experimental results confirm the effectiveness of the adopted pretraining strategy and demonstrate how CMRA contributes to the recognition process. Remarkably, leveraging a Conformer-based backbone, our model achieves state-of-the-art performance on the benchmark datasets: PHOENIX-2014 and PHOENIX-2014T.
Related papers
- VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - A Transformer Model for Boundary Detection in Continuous Sign Language [55.05986614979846]
The Transformer model is employed for both Isolated Sign Language Recognition and Continuous Sign Language Recognition.
The training process involves using isolated sign videos, where hand keypoint features extracted from the input video are enriched.
The trained model, coupled with a post-processing method, is then applied to detect isolated sign boundaries within continuous sign videos.
arXiv Detail & Related papers (2024-02-22T17:25:01Z) - FLIP: Fine-grained Alignment between ID-based Models and Pretrained Language Models for CTR Prediction [49.510163437116645]
Click-through rate (CTR) prediction plays as a core function module in personalized online services.
Traditional ID-based models for CTR prediction take as inputs the one-hot encoded ID features of tabular modality.
Pretrained Language Models(PLMs) has given rise to another paradigm, which takes as inputs the sentences of textual modality.
We propose to conduct Fine-grained feature-level ALignment between ID-based Models and Pretrained Language Models(FLIP) for CTR prediction.
arXiv Detail & Related papers (2023-10-30T11:25:03Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - SignBERT+: Hand-model-aware Self-supervised Pre-training for Sign
Language Understanding [132.78015553111234]
Hand gesture serves as a crucial role during the expression of sign language.
Current deep learning based methods for sign language understanding (SLU) are prone to over-fitting due to insufficient sign data resource.
We propose the first self-supervised pre-trainable SignBERT+ framework with model-aware hand prior incorporated.
arXiv Detail & Related papers (2023-05-08T17:16:38Z) - Improving Continuous Sign Language Recognition with Consistency
Constraints and Signer Removal [24.537234147678113]
We propose three auxiliary tasks to enhance the CSLR backbones.
A keypoint-guided spatial attention module is developed to enforce the visual module.
A sentence embedding consistency constraint is imposed between the visual and sequential modules.
Our model achieves state-of-the-art or competitive performance on five benchmarks.
arXiv Detail & Related papers (2022-12-26T06:38:34Z) - Hierarchical Conditional End-to-End ASR with CTC and Multi-Granular
Subword Units [19.668440671541546]
In end-to-end automatic speech recognition, a model is expected to implicitly learn representations suitable for recognizing a word-level sequence.
We propose a hierarchical conditional model that is based on connectionist temporal classification ( CTC)
Experimental results on LibriSpeech-100h, 960h and TEDLIUM2 demonstrate that the proposed model improves over a standard CTC-based model.
arXiv Detail & Related papers (2021-10-08T13:15:58Z) - COCO-LM: Correcting and Contrasting Text Sequences for Language Model
Pretraining [59.169836983883656]
COCO-LM is a new self-supervised learning framework that pretrains Language Models by COrrecting challenging errors and COntrasting text sequences.
COCO-LM employs an auxiliary language model to mask-and-predict tokens in original text sequences.
Our analyses reveal that COCO-LM's advantages come from its challenging training signals, more contextualized token representations, and regularized sequence representations.
arXiv Detail & Related papers (2021-02-16T22:24:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.