Training Strategies for Isolated Sign Language Recognition
- URL: http://arxiv.org/abs/2412.11553v2
- Date: Mon, 12 May 2025 19:55:49 GMT
- Title: Training Strategies for Isolated Sign Language Recognition
- Authors: Karina Kvanchiani, Roman Kraynov, Elizaveta Petrova, Petr Surovcev, Aleksandr Nagaev, Alexander Kapitanov,
- Abstract summary: This paper introduces a comprehensive model training pipeline for Isolated Sign Language Recognition.<n>The constructed pipeline incorporates carefully selected image and video augmentations to tackle the challenges of low data quality and varying sign speeds.
- Score: 72.27323884094953
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Accurate recognition and interpretation of sign language are crucial for enhancing communication accessibility for deaf and hard of hearing individuals. However, current approaches of Isolated Sign Language Recognition (ISLR) often face challenges such as low data quality and variability in gesturing speed. This paper introduces a comprehensive model training pipeline for ISLR designed to accommodate the distinctive characteristics and constraints of the Sign Language (SL) domain. The constructed pipeline incorporates carefully selected image and video augmentations to tackle the challenges of low data quality and varying sign speeds. Including an additional regression head combined with IoU-balanced classification loss enhances the model's awareness of the gesture and simplifies capturing temporal information. Extensive experiments demonstrate that the developed training pipeline easily adapts to different datasets and architectures. Additionally, the ablation study shows that each proposed component expands the potential to consider ISLR task specifics. The presented strategies enhance recognition performance across various ISLR benchmarks and achieve state-of-the-art results on the WLASL and Slovo datasets.
Related papers
- An Empirical Study of Federated Prompt Learning for Vision Language Model [50.73746120012352]
This paper systematically investigates behavioral differences between language prompt learning and vision prompt learning.<n>We conduct experiments to evaluate the impact of various fl and prompt configurations, such as client scale, aggregation strategies, and prompt length.<n>We explore strategies for enhancing prompt learning in complex scenarios where label skew and domain shift coexist.
arXiv Detail & Related papers (2025-05-29T03:09:15Z) - SSLR: A Semi-Supervised Learning Method for Isolated Sign Language Recognition [2.409285779772107]
Sign language recognition systems aim to recognize sign gestures and translate them into spoken language.<n>One of the main challenges in SLR is the scarcity of annotated datasets.<n>We propose a semi-supervised learning approach for SLR, employing a pseudo-label method to annotate unlabeled samples.
arXiv Detail & Related papers (2025-04-23T11:59:52Z) - Underlying Semantic Diffusion for Effective and Efficient In-Context Learning [113.4003355229632]
Underlying Semantic Diffusion (US-Diffusion) is an enhanced diffusion model that boosts underlying semantics learning, computational efficiency, and in-context learning capabilities.
We present a Feedback-Aided Learning (FAL) framework, which leverages feedback signals to guide the model in capturing semantic details.
We also propose a plug-and-play Efficient Sampling Strategy (ESS) for dense sampling at time steps with high-noise levels.
arXiv Detail & Related papers (2025-03-06T03:06:22Z) - IncSAR: A Dual Fusion Incremental Learning Framework for SAR Target Recognition [7.9330990800767385]
Models' tendency to forget old knowledge when learning new tasks, known as catastrophic forgetting, remains an open challenge.
In this paper, an incremental learning framework, called IncSAR, is proposed to mitigate catastrophic forgetting in SAR target recognition.
IncSAR comprises a Vision Transformer (ViT) and a custom-designed Convolutional Neural Network (CNN) in individual branches combined through a late-fusion strategy.
arXiv Detail & Related papers (2024-10-08T08:49:47Z) - Context-Aware Predictive Coding: A Representation Learning Framework for WiFi Sensing [0.0]
WiFi sensing is an emerging technology that utilizes wireless signals for various sensing applications.
In this paper, we introduce a novel SSL framework called Context-Aware Predictive Coding (CAPC)
CAPC effectively learns from unlabelled data and adapts to diverse environments.
Our evaluations demonstrate that CAPC not only outperforms other SSL methods and supervised approaches, but also achieves superior generalization capabilities.
arXiv Detail & Related papers (2024-09-16T17:59:49Z) - Content-decoupled Contrastive Learning-based Implicit Degradation Modeling for Blind Image Super-Resolution [33.16889233975723]
Implicit degradation modeling-based blind super-resolution (SR) has attracted more increasing attention in the community.
We propose a new Content-decoupled Contrastive Learning-based blind image super-resolution (CdCL) framework.
arXiv Detail & Related papers (2024-08-10T04:51:43Z) - ACTRESS: Active Retraining for Semi-supervised Visual Grounding [52.08834188447851]
A previous study, RefTeacher, makes the first attempt to tackle this task by adopting the teacher-student framework to provide pseudo confidence supervision and attention-based supervision.
This approach is incompatible with current state-of-the-art visual grounding models, which follow the Transformer-based pipeline.
Our paper proposes the ACTive REtraining approach for Semi-Supervised Visual Grounding, abbreviated as ACTRESS.
arXiv Detail & Related papers (2024-07-03T16:33:31Z) - Enhancing Brazilian Sign Language Recognition through Skeleton Image Representation [2.6311088262657907]
This work proposes an Isolated Sign Language Recognition (ISLR) approach where body, hands, and facial landmarks are extracted throughout time and encoded as 2-D images.
We show that our method surpassed the state-of-the-art in terms of performance metrics on two widely recognized datasets in Brazilian Sign Language (LIBRAS)
In addition to being more accurate, our method is more time-efficient and easier to train due to its reliance on a simpler network architecture and solely RGB data as input.
arXiv Detail & Related papers (2024-04-29T23:21:17Z) - REAL: Representation Enhanced Analytic Learning for Exemplar-free Class-incremental Learning [12.197327462627912]
We propose a representation enhanced analytic learning (REAL) for Exemplar-free class-incremental learning (EFCIL)
The REAL constructs a dual-stream base pretraining (DS-BPT) and a representation enhancing distillation (RED) process to enhance the representation of the extractor.
Our method addresses the issue of insufficient discriminability in representations of unseen data caused by a frozen backbone in the existing AL-based CIL.
arXiv Detail & Related papers (2024-03-20T11:48:10Z) - RESMatch: Referring Expression Segmentation in a Semi-Supervised Manner [16.280644319404946]
Referring expression segmentation (RES) is a task that involves localizing specific instance-level objects based on free-form linguistic descriptions.
This paper introduces RESMatch, the first semi-supervised learning (SSL) approach for RES, aimed at reducing reliance on exhaustive data annotation.
arXiv Detail & Related papers (2024-02-08T11:40:50Z) - SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by
Visual-Textual Contrastive Learning [51.800031281177105]
SignVTCL is a continuous sign language recognition framework enhanced by visual-textual contrastive learning.
It integrates multi-modal data (video, keypoints, and optical flow) simultaneously to train a unified visual backbone.
It achieves state-of-the-art results compared with previous methods.
arXiv Detail & Related papers (2024-01-22T11:04:55Z) - Self-Supervised Video Transformers for Isolated Sign Language
Recognition [19.72944125318495]
We consider four recently introduced transformer-based approaches to self-supervised learning from videos, and four pre-training data regimes.
MaskFeat achieves performance superior to pose-based and supervised video models, with a top-1 accuracy of 79.02% on gloss-based WLASL2000.
arXiv Detail & Related papers (2023-09-02T03:00:03Z) - Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - Accelerating exploration and representation learning with offline
pre-training [52.6912479800592]
We show that exploration and representation learning can be improved by separately learning two different models from a single offline dataset.
We show that learning a state representation using noise-contrastive estimation and a model of auxiliary reward can significantly improve the sample efficiency on the challenging NetHack benchmark.
arXiv Detail & Related papers (2023-03-31T18:03:30Z) - FeaRLESS: Feature Refinement Loss for Ensembling Self-Supervised
Learning Features in Robust End-to-end Speech Recognition [34.40924909515384]
We propose to investigate effectiveness of diverse SSLR combinations using various fusion methods within end-to-end (E2E) ASR models.
We show that the proposed 'FeaRLESS learning features' perform better than systems without the proposed feature refinement loss for both the WSJ and Fearless Steps Challenge (FSC) corpora.
arXiv Detail & Related papers (2022-06-30T06:39:40Z) - Multi-Augmentation for Efficient Visual Representation Learning for
Self-supervised Pre-training [1.3733988835863333]
We propose Multi-Augmentations for Self-Supervised Learning (MA-SSRL), which fully searched for various augmentation policies to build the entire pipeline.
MA-SSRL successfully learns the invariant feature representation and presents an efficient, effective, and adaptable data augmentation pipeline for self-supervised pre-training.
arXiv Detail & Related papers (2022-05-24T04:18:39Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - Adaptive Adversarial Logits Pairing [65.51670200266913]
An adversarial training solution Adversarial Logits Pairing (ALP) tends to rely on fewer high-contribution features compared with vulnerable ones.
Motivated by these observations, we design an Adaptive Adversarial Logits Pairing (AALP) solution by modifying the training process and training target of ALP.
AALP consists of an adaptive feature optimization module with Guided Dropout to systematically pursue fewer high-contribution features.
arXiv Detail & Related papers (2020-05-25T03:12:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.