Training Strategies for Isolated Sign Language Recognition
- URL: http://arxiv.org/abs/2412.11553v1
- Date: Mon, 16 Dec 2024 08:37:58 GMT
- Title: Training Strategies for Isolated Sign Language Recognition
- Authors: Karina Kvanchiani, Roman Kraynov, Elizaveta Petrova, Petr Surovcev, Aleksandr Nagaev, Alexander Kapitanov,
- Abstract summary: This paper introduces a comprehensive model training pipeline for Isolated Sign Language Recognition.
The constructed pipeline incorporates carefully selected image and video augmentations to tackle the challenges of low data quality and varying sign speeds.
We achieve a state-of-the-art result on the WLASL and Slovo benchmarks with 1.63% and 14.12% improvements compared to the previous best solution.
- Score: 72.27323884094953
- License:
- Abstract: This paper introduces a comprehensive model training pipeline for Isolated Sign Language Recognition (ISLR) designed to accommodate the distinctive characteristics and constraints of the Sign Language (SL) domain. The constructed pipeline incorporates carefully selected image and video augmentations to tackle the challenges of low data quality and varying sign speeds. Including an additional regression head combined with IoU-balanced classification loss enhances the model's awareness of the gesture and simplifies capturing temporal information. Extensive experiments demonstrate that the developed training pipeline easily adapts to different datasets and architectures. Additionally, the ablation study shows that each proposed component expands the potential to consider ISLR task specifics. The presented strategies improve recognition performance on a broad set of ISLR benchmarks. Moreover, we achieved a state-of-the-art result on the WLASL and Slovo benchmarks with 1.63% and 14.12% improvements compared to the previous best solution, respectively.
Related papers
- Graph Structure Refinement with Energy-based Contrastive Learning [56.957793274727514]
We introduce an unsupervised method based on a joint of generative training and discriminative training to learn graph structure and representation.
We propose an Energy-based Contrastive Learning (ECL) guided Graph Structure Refinement (GSR) framework, denoted as ECL-GSR.
ECL-GSR achieves faster training with fewer samples and memories against the leading baseline, highlighting its simplicity and efficiency in downstream tasks.
arXiv Detail & Related papers (2024-12-20T04:05:09Z) - ACTRESS: Active Retraining for Semi-supervised Visual Grounding [52.08834188447851]
A previous study, RefTeacher, makes the first attempt to tackle this task by adopting the teacher-student framework to provide pseudo confidence supervision and attention-based supervision.
This approach is incompatible with current state-of-the-art visual grounding models, which follow the Transformer-based pipeline.
Our paper proposes the ACTive REtraining approach for Semi-Supervised Visual Grounding, abbreviated as ACTRESS.
arXiv Detail & Related papers (2024-07-03T16:33:31Z) - REAL: Representation Enhanced Analytic Learning for Exemplar-free Class-incremental Learning [12.197327462627912]
We propose a representation enhanced analytic learning (REAL) for Exemplar-free class-incremental learning (EFCIL)
The REAL constructs a dual-stream base pretraining (DS-BPT) and a representation enhancing distillation (RED) process to enhance the representation of the extractor.
Our method addresses the issue of insufficient discriminability in representations of unseen data caused by a frozen backbone in the existing AL-based CIL.
arXiv Detail & Related papers (2024-03-20T11:48:10Z) - RESMatch: Referring Expression Segmentation in a Semi-Supervised Manner [16.280644319404946]
Referring expression segmentation (RES) is a task that involves localizing specific instance-level objects based on free-form linguistic descriptions.
This paper introduces RESMatch, the first semi-supervised learning (SSL) approach for RES, aimed at reducing reliance on exhaustive data annotation.
arXiv Detail & Related papers (2024-02-08T11:40:50Z) - Self-Supervised Video Transformers for Isolated Sign Language
Recognition [19.72944125318495]
We consider four recently introduced transformer-based approaches to self-supervised learning from videos, and four pre-training data regimes.
MaskFeat achieves performance superior to pose-based and supervised video models, with a top-1 accuracy of 79.02% on gloss-based WLASL2000.
arXiv Detail & Related papers (2023-09-02T03:00:03Z) - Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - FeaRLESS: Feature Refinement Loss for Ensembling Self-Supervised
Learning Features in Robust End-to-end Speech Recognition [34.40924909515384]
We propose to investigate effectiveness of diverse SSLR combinations using various fusion methods within end-to-end (E2E) ASR models.
We show that the proposed 'FeaRLESS learning features' perform better than systems without the proposed feature refinement loss for both the WSJ and Fearless Steps Challenge (FSC) corpora.
arXiv Detail & Related papers (2022-06-30T06:39:40Z) - Multi-Augmentation for Efficient Visual Representation Learning for
Self-supervised Pre-training [1.3733988835863333]
We propose Multi-Augmentations for Self-Supervised Learning (MA-SSRL), which fully searched for various augmentation policies to build the entire pipeline.
MA-SSRL successfully learns the invariant feature representation and presents an efficient, effective, and adaptable data augmentation pipeline for self-supervised pre-training.
arXiv Detail & Related papers (2022-05-24T04:18:39Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Adaptive Adversarial Logits Pairing [65.51670200266913]
An adversarial training solution Adversarial Logits Pairing (ALP) tends to rely on fewer high-contribution features compared with vulnerable ones.
Motivated by these observations, we design an Adaptive Adversarial Logits Pairing (AALP) solution by modifying the training process and training target of ALP.
AALP consists of an adaptive feature optimization module with Guided Dropout to systematically pursue fewer high-contribution features.
arXiv Detail & Related papers (2020-05-25T03:12:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.