Fine-Tuning Video Transformers for Word-Level Bangla Sign Language: A Comparative Analysis for Classification Tasks
- URL: http://arxiv.org/abs/2506.04367v1
- Date: Wed, 04 Jun 2025 18:29:32 GMT
- Title: Fine-Tuning Video Transformers for Word-Level Bangla Sign Language: A Comparative Analysis for Classification Tasks
- Authors: Jubayer Ahmed Bhuiyan Shawon, Hasan Mahmud, Kamrul Hasan,
- Abstract summary: Sign Language Recognition involves the automatic identification and classification of sign gestures from images or video.<n>In Bangladesh, Bangla Sign Language serves as the primary mode of communication for many individuals with hearing impairments.<n>This study fine-tunes state-of-the-art video transformer architectures on a small-scale BdSL dataset.
- Score: 1.633119622546771
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Sign Language Recognition (SLR) involves the automatic identification and classification of sign gestures from images or video, converting them into text or speech to improve accessibility for the hearing-impaired community. In Bangladesh, Bangla Sign Language (BdSL) serves as the primary mode of communication for many individuals with hearing impairments. This study fine-tunes state-of-the-art video transformer architectures -- VideoMAE, ViViT, and TimeSformer -- on BdSLW60 (arXiv:2402.08635), a small-scale BdSL dataset with 60 frequent signs. We standardized the videos to 30 FPS, resulting in 9,307 user trial clips. To evaluate scalability and robustness, the models were also fine-tuned on BdSLW401 (arXiv:2503.02360), a large-scale dataset with 401 sign classes. Additionally, we benchmark performance against public datasets, including LSA64 and WLASL. Data augmentation techniques such as random cropping, horizontal flipping, and short-side scaling were applied to improve model robustness. To ensure balanced evaluation across folds during model selection, we employed 10-fold stratified cross-validation on the training set, while signer-independent evaluation was carried out using held-out test data from unseen users U4 and U8. Results show that video transformer models significantly outperform traditional machine learning and deep learning approaches. Performance is influenced by factors such as dataset size, video quality, frame distribution, frame rate, and model architecture. Among the models, the VideoMAE variant (MCG-NJU/videomae-base-finetuned-kinetics) achieved the highest accuracies of 95.5% on the frame rate corrected BdSLW60 dataset and 81.04% on the front-facing signs of BdSLW401 -- demonstrating strong potential for scalable and accurate BdSL recognition.
Related papers
- LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale [35.58838734226919]
We propose a novel streaming training approach that densely interleaves the ASR words and video frames according to their timestamps.<n>Compared to previous studies in vision-language representation with ASR, our method naturally fits the streaming characteristics of ASR.<n> Experiments show our final LiveCC-7B-Instruct model can surpass advanced 72B models in commentary quality even working in a real-time mode.
arXiv Detail & Related papers (2025-04-22T16:52:09Z) - SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning [78.44705665291741]
We present a comprehensive evaluation of modern video self-supervised models.<n>We focus on generalization across four key downstream factors: domain shift, sample efficiency, action granularity, and task diversity.<n>Our analysis shows that, despite architectural advances, transformer-based models remain sensitive to downstream conditions.
arXiv Detail & Related papers (2025-04-08T06:00:28Z) - DeepSound-V1: Start to Think Step-by-Step in the Audio Generation from Videos [4.452513686760606]
We propose a framework for audio generation from videos, leveraging the internal chain-of-thought (CoT) of a multi-modal large language model (MLLM)<n>A corresponding multi-modal reasoning dataset is constructed to facilitate the learning of initial reasoning in audio generation.<n>In experiments, we demonstrate the effectiveness of the proposed framework in reducing misalignment (voice-over) in generated audio.
arXiv Detail & Related papers (2025-03-28T07:56:19Z) - BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding [51.49345400300556]
Large video-language models (VLMs) have demonstrated promising progress in various video understanding tasks.<n>Traditional approaches, such as uniform frame sampling, often inevitably allocate resources to irrelevant content.<n>We introduce BOLT, a method to BOost Large VLMs without additional Training through a comprehensive study of frame selection strategies.
arXiv Detail & Related papers (2025-03-27T13:18:40Z) - SignSpeak: Open-Source Time Series Classification for ASL Translation [0.12499537119440243]
We propose a low-cost, real-time ASL-to-speech translation glove and an exhaustive training dataset of sign language patterns.
We benchmarked this dataset with supervised learning models, such as LSTMs, GRUs and Transformers, where our best model achieved 92% accuracy.
Our open-source dataset, models and glove designs provide an accurate and efficient ASL translator while maintaining cost-effectiveness.
arXiv Detail & Related papers (2024-06-27T17:58:54Z) - BdSLW60: A Word-Level Bangla Sign Language Dataset [3.8631510994883254]
We create a comprehensive BdSL word-level dataset named BdSLW60 in an unconstrained and natural setting.
The dataset encompasses 60 Bangla sign words, with a significant scale of 9307 video trials provided by 18 signers under the supervision of a sign language professional.
We report the benchmarking of our BdSLW60 dataset using the Support Vector Machine (SVM) with testing accuracy up to 67.6% and an attention-based bi-LSTM with testing accuracy up to 75.1%.
arXiv Detail & Related papers (2024-02-13T18:02:58Z) - Perception Test: A Diagnostic Benchmark for Multimodal Video Models [78.64546291816117]
We propose a novel multimodal video benchmark to evaluate the perception and reasoning skills of pre-trained multimodal models.
The Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities.
The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
arXiv Detail & Related papers (2023-05-23T07:54:37Z) - Revisiting Classifier: Transferring Vision-Language Models for Video
Recognition [102.93524173258487]
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research.
In this study, we focus on transferring knowledge for video classification tasks.
We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
arXiv Detail & Related papers (2022-07-04T10:00:47Z) - CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms.
Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner.
Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z) - VATT: Transformers for Multimodal Self-Supervised Learning from Raw
Video, Audio and Text [60.97904439526213]
Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks.
We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.
arXiv Detail & Related papers (2021-04-22T17:07:41Z) - AUTSL: A Large Scale Multi-modal Turkish Sign Language Dataset and
Baseline Methods [6.320141734801679]
We present a new largescale multi-modal Turkish Sign Language dataset (AUTSL) with a benchmark.
Our dataset consists of 226 signs performed by 43 different signers and 38,336 isolated sign video samples.
We trained several deep learning based models and provide empirical evaluations using the benchmark.
arXiv Detail & Related papers (2020-08-03T15:12:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.