A Signer-Invariant Conformer and Multi-Scale Fusion Transformer for Continuous Sign Language Recognition
- URL: http://arxiv.org/abs/2508.09372v1
- Date: Tue, 12 Aug 2025 21:59:53 GMT
- Title: A Signer-Invariant Conformer and Multi-Scale Fusion Transformer for Continuous Sign Language Recognition
- Authors: Md Rezwanul Haque, Md. Milon Islam, S M Taslim Uddin Raju, Fakhri Karray,
- Abstract summary: Continuous Sign Language Recognition faces multiple challenges, including inter-signer variability and poor generalization to novel sentence structures.<n>For the Signer-Independent (SI) challenge, we propose a Signer-Invariant Conformer that combines convolutions with multi-head self-attention.<n>For the Unseen-Sentences (US) task, we designed a Multi-Scale Fusion Transformer with a novel dual-path temporal encoder.<n>The proposed conformer architecture achieves a Word Error Rate (WER) of 13.07% on the SI challenge, a reduction of 13.53% from the state-of-the-art
- Score: 1.3499500088995462
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Continuous Sign Language Recognition (CSLR) faces multiple challenges, including significant inter-signer variability and poor generalization to novel sentence structures. Traditional solutions frequently fail to handle these issues efficiently. For overcoming these constraints, we propose a dual-architecture framework. For the Signer-Independent (SI) challenge, we propose a Signer-Invariant Conformer that combines convolutions with multi-head self-attention to learn robust, signer-agnostic representations from pose-based skeletal keypoints. For the Unseen-Sentences (US) task, we designed a Multi-Scale Fusion Transformer with a novel dual-path temporal encoder that captures both fine-grained posture dynamics, enabling the model's ability to comprehend novel grammatical compositions. Experiments on the challenging Isharah-1000 dataset establish a new standard for both CSLR benchmarks. The proposed conformer architecture achieves a Word Error Rate (WER) of 13.07% on the SI challenge, a reduction of 13.53% from the state-of-the-art. On the US task, the transformer model scores a WER of 47.78%, surpassing previous work. In the SignEval 2025 CSLR challenge, our team placed 2nd in the US task and 4th in the SI task, demonstrating the performance of these models. The findings validate our key hypothesis: that developing task-specific networks designed for the particular challenges of CSLR leads to considerable performance improvements and establishes a new baseline for further research. The source code is available at: https://github.com/rezwanh001/MSLR-Pose86K-CSLR-Isharah.
Related papers
- UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model [50.68870074090426]
We introduce UniWeTok, a unified discrete tokenizer for Unified Multimodal Large Language Models.<n>For training framework, we introduce Pre-Post Distillation and a Generative-Aware Prior to enhance the semantic extraction and generative prior of the discrete tokens.<n>We propose a three-stage training framework designed to enhance UniWeTok's adaptability cross various image resolutions and perception-sensitive scenarios.
arXiv Detail & Related papers (2026-02-15T15:07:19Z) - MaDiS: Taming Masked Diffusion Language Models for Sign Language Generation [78.75809158246723]
We present MaDiS, a masked-diffusion-based language model for SLG that captures bidirectional and supports efficient parallel multi-token generation.<n>We also introduce a tri-level cross-modal pretraining scheme that jointly learns from token-, latent-Hearing, and 3D-space objectives.<n>MaDiS achieves superior performance across multiple metrics, including DTW error and two newly introduced metrics, SiBLEU and SiCLIP, while reducing inference latency by nearly 30%.
arXiv Detail & Related papers (2026-01-27T13:06:47Z) - Reasoning-Aware Prompt Orchestration: A Foundation Model for Multi-Agent Language Model Coordination [0.0]
We present a theoretically-grounded framework for dynamic prompt orchestration that enhances reasoning across multiple specialized agents.<n>This framework addresses three core challenges: logical consistency preservation during agent transitions, reasoning-aware prompt adaptation, and scalable coordination of distributed inference.<n> Experimental results on 1,000 synthetic multi-agent conversations demonstrate a 42% reduction in reasoning latency, a 23% improvement in logical consistency measured by ROUGE-L score, and an 89% success rate for task completion without context loss.
arXiv Detail & Related papers (2025-09-30T22:33:01Z) - MultiStream-LLM: Bridging Modalities for Robust Sign Language Translation [19.197388907510746]
We introduce MultiStream-LLM, a modular framework designed to overcome limitations in automated sign language translation.<n>Our method establishes a new state-of-the-art on the How2Sign benchmark with a BLEU-4 score of 23.5 and 73.2% letter accuracy on the ChicagoFSWildPlus fingerspelling dataset.
arXiv Detail & Related papers (2025-08-20T17:44:47Z) - Small transformer architectures for task switching [2.7195102129095003]
It is non-trivial to conceive small-scale applications for which attention-based architectures outperform traditional approaches.<n>We show that standard transformers cannot solve a basic task switching reference model.<n>We show that transformers, long short-term memory recurrent networks (LSTM), and plain multi-layer perceptrons (MLPs) achieve similar, but only modest prediction accuracies.
arXiv Detail & Related papers (2025-08-06T14:01:05Z) - CSLRConformer: A Data-Centric Conformer Approach for Continuous Arabic Sign Language Recognition on the Isharah Datase [0.0]
This paper addresses the challenge of signer-independent recognition to advance the capabilities of Continuous Sign Language Recognition systems.<n>A data-centric methodology is proposed, centered on systematic feature engineering, a robust preprocessing pipeline, and an optimized model architecture.<n>The architecture adapts the hybrid CNN-Transformer design of the Conformer model, leveraging its capacity to model local temporal dependencies and global sequence context.
arXiv Detail & Related papers (2025-08-03T14:58:50Z) - Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition [16.68658893305642]
Handwritten Mathematical Expression Recognition (HMER) remains a persistent challenge in Optical Character Recognition (OCR)<n>We introduce Uni-MuMER, which fully fine-tunes a vision-language model for the HMER task without modifying its architecture.<n>Our method integrates three data-driven tasks: Tree-Aware Chain-of-Thought (Tree-CoT) for structured spatial reasoning, Error-Driven Learning (EDL) for reducing confusion among visually similar characters, and Symbol Counting (SC) for improving recognition consistency in long expressions.
arXiv Detail & Related papers (2025-05-29T15:41:00Z) - Stack Transformer Based Spatial-Temporal Attention Model for Dynamic Sign Language and Fingerspelling Recognition [1.949837893170278]
Hand gesture-based Sign Language Recognition serves as a crucial bridge between deaf and non-deaf individuals.<n>We propose the Sequential Spatio-Temporal Attention Network (SSTAN), a novel Transformer-based architecture.<n>We validated our model through extensive experiments on diverse, large-scale datasets.
arXiv Detail & Related papers (2025-03-21T04:57:18Z) - A Transformer Model for Boundary Detection in Continuous Sign Language [55.05986614979846]
The Transformer model is employed for both Isolated Sign Language Recognition and Continuous Sign Language Recognition.
The training process involves using isolated sign videos, where hand keypoint features extracted from the input video are enriched.
The trained model, coupled with a post-processing method, is then applied to detect isolated sign boundaries within continuous sign videos.
arXiv Detail & Related papers (2024-02-22T17:25:01Z) - An Empirical Study of Multimodal Model Merging [148.48412442848795]
Model merging is a technique that fuses multiple models trained on different tasks to generate a multi-task solution.
We conduct our study for a novel goal where we can merge vision, language, and cross-modal transformers of a modality-specific architecture.
We propose two metrics that assess the distance between weights to be merged and can serve as an indicator of the merging outcomes.
arXiv Detail & Related papers (2023-04-28T15:43:21Z) - Relaxed Attention for Transformer Models [29.896876421216373]
In this paper, we explore relaxed attention, a simple and easy-to-implement smoothing of the attention weights.
We show that relaxed attention provides regularization when applied to the self-attention layers in the encoder.
We demonstrate the benefit of relaxed attention across several tasks with clear improvement in combination with recent benchmark approaches.
arXiv Detail & Related papers (2022-09-20T14:10:28Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z) - Switchable Representation Learning Framework with Self-compatibility [50.48336074436792]
We propose a Switchable representation learning Framework with Self-Compatibility (SFSC)
SFSC generates a series of compatible sub-models with different capacities through one training process.
SFSC achieves state-of-the-art performance on the evaluated datasets.
arXiv Detail & Related papers (2022-06-16T16:46:32Z) - Shunted Self-Attention via Multi-Scale Token Aggregation [124.16925784748601]
Recent Vision Transformer(ViT) models have demonstrated encouraging results across various computer vision tasks.
We propose shunted self-attention(SSA) that allows ViTs to model the attentions at hybrid scales per attention layer.
The SSA-based transformer achieves 84.0% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet.
arXiv Detail & Related papers (2021-11-30T08:08:47Z) - Inducing Transformer's Compositional Generalization Ability via
Auxiliary Sequence Prediction Tasks [86.10875837475783]
Systematic compositionality is an essential mechanism in human language, allowing the recombination of known parts to create novel expressions.
Existing neural models have been shown to lack this basic ability in learning symbolic structures.
We propose two auxiliary sequence prediction tasks that track the progress of function and argument semantics.
arXiv Detail & Related papers (2021-09-30T16:41:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.