Towards Online Sign Language Recognition and Translation
- URL: http://arxiv.org/abs/2401.05336v1
- Date: Wed, 10 Jan 2024 18:59:53 GMT
- Title: Towards Online Sign Language Recognition and Translation
- Authors: Ronglai Zuo, Fangyun Wei, Brian Mak
- Abstract summary: We develop a sign language dictionary encompassing all glosses present in a target sign language dataset.
We train an isolated sign language recognition model on augmented signs using both conventional classification loss and our novel saliency loss.
Our online recognition model can be extended to boost the performance of any offline model.
- Score: 41.85360877354916
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The objective of sign language recognition is to bridge the communication gap
between the deaf and the hearing. Numerous previous works train their models
using the well-established connectionist temporal classification (CTC) loss.
During the inference stage, the CTC-based models typically take the entire sign
video as input to make predictions. This type of inference scheme is referred
to as offline recognition. In contrast, while mature speech recognition systems
can efficiently recognize spoken words on the fly, sign language recognition
still falls short due to the lack of practical online solutions. In this work,
we take the first step towards filling this gap. Our approach comprises three
phases: 1) developing a sign language dictionary encompassing all glosses
present in a target sign language dataset; 2) training an isolated sign
language recognition model on augmented signs using both conventional
classification loss and our novel saliency loss; 3) employing a sliding window
approach on the input sign sequence and feeding each sign clip to the
well-optimized model for online recognition. Furthermore, our online
recognition model can be extended to boost the performance of any offline
model, and to support online translation by appending a gloss-to-text network
onto the recognition model. By integrating our online framework with the
previously best-performing offline model, TwoStream-SLR, we achieve new
state-of-the-art performance on three benchmarks: Phoenix-2014, Phoenix-2014T,
and CSL-Daily. Code and models will be available at
https://github.com/FangyunWei/SLRT
Related papers
- MS2SL: Multimodal Spoken Data-Driven Continuous Sign Language Production [93.32354378820648]
We propose a unified framework for continuous sign language production, easing communication between sign and non-sign language users.
A sequence diffusion model, utilizing embeddings extracted from text or speech, is crafted to generate sign predictions step by step.
Experiments on How2Sign and PHOENIX14T datasets demonstrate that our model achieves competitive performance in sign language production.
arXiv Detail & Related papers (2024-07-04T13:53:50Z) - Continuous Sign Language Recognition with Adapted Conformer via Unsupervised Pretraining [0.6144680854063939]
State-of-the-art Conformer model for Speech Recognition is adapted for continuous sign language recognition.
This marks the first instance of employing Conformer for a vision-based task.
Unsupervised pretraining is conducted on a curated sign language dataset.
arXiv Detail & Related papers (2024-05-20T13:40:52Z) - Improving Continuous Sign Language Recognition with Adapted Image Models [9.366498095041814]
Large-scale vision-language models (e.g., CLIP) have shown impressive generalization performance over a series of downstream tasks.
To enable high efficiency when adapting these large vision-language models to performing continuous sign language recognition, we propose a novel strategy (AdaptSign)
AdaptSign is able to demonstrate superior performance across a series of CSLR benchmarks including PHOENIX14, PHOENIX14-T, CSL-Daily and CSL compared to existing methods.
arXiv Detail & Related papers (2024-04-12T03:43:37Z) - A Transformer Model for Boundary Detection in Continuous Sign Language [55.05986614979846]
The Transformer model is employed for both Isolated Sign Language Recognition and Continuous Sign Language Recognition.
The training process involves using isolated sign videos, where hand keypoint features extracted from the input video are enriched.
The trained model, coupled with a post-processing method, is then applied to detect isolated sign boundaries within continuous sign videos.
arXiv Detail & Related papers (2024-02-22T17:25:01Z) - APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models.
APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z) - FLIP: Fine-grained Alignment between ID-based Models and Pretrained Language Models for CTR Prediction [49.510163437116645]
Click-through rate (CTR) prediction plays as a core function module in personalized online services.
Traditional ID-based models for CTR prediction take as inputs the one-hot encoded ID features of tabular modality.
Pretrained Language Models(PLMs) has given rise to another paradigm, which takes as inputs the sentences of textual modality.
We propose to conduct Fine-grained feature-level ALignment between ID-based Models and Pretrained Language Models(FLIP) for CTR prediction.
arXiv Detail & Related papers (2023-10-30T11:25:03Z) - Improving Continuous Sign Language Recognition with Cross-Lingual Signs [29.077175863743484]
We study the feasibility of utilizing multilingual sign language corpora to facilitate continuous sign language recognition.
We first build two sign language dictionaries containing isolated signs that appear in two datasets.
Then we identify the sign-to-sign mappings between two sign languages via a well-optimized isolated sign language recognition model.
arXiv Detail & Related papers (2023-08-21T15:58:47Z) - Fine-tuning of sign language recognition models: a technical report [0.0]
We focus on investigating two questions: how fine-tuning on datasets from other sign languages helps improve sign recognition quality, and whether sign recognition is possible in real-time without using GPU.
We provide code for reproducing model training experiments, converting models to ONNX format, and inference for real-time gesture recognition.
arXiv Detail & Related papers (2023-02-15T14:36:18Z) - End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer)
In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms.
We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.