Related papers: SimulSLT: End-to-End Simultaneous Sign Language Translation

SimulSLT: End-to-End Simultaneous Sign Language Translation

URL: http://arxiv.org/abs/2112.04228v1
Date: Wed, 8 Dec 2021 11:04:52 GMT
Title: SimulSLT: End-to-End Simultaneous Sign Language Translation
Authors: Aoxiong Yin, Zhou Zhao, Jinglin Liu, Weike Jin, Meng Zhang, Xingshan Zeng, Xiaofei He
Abstract summary: Existing sign language translation methods need to read all the videos before starting the translation. We propose SimulSLT, the first end-to-end simultaneous sign language translation model. SimulSLT achieves BLEU scores that exceed the latest end-to-end non-simultaneous sign language translation model.
Score: 55.54237194555432
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sign language translation as a kind of technology with profound social significance has attracted growing researchers' interest in recent years. However, the existing sign language translation methods need to read all the videos before starting the translation, which leads to a high inference latency and also limits their application in real-life scenarios. To solve this problem, we propose SimulSLT, the first end-to-end simultaneous sign language translation model, which can translate sign language videos into target text concurrently. SimulSLT is composed of a text decoder, a boundary predictor, and a masked encoder. We 1) use the wait-k strategy for simultaneous translation. 2) design a novel boundary predictor based on the integrate-and-fire module to output the gloss boundary, which is used to model the correspondence between the sign language video and the gloss. 3) propose an innovative re-encode method to help the model obtain more abundant contextual information, which allows the existing video features to interact fully. The experimental results conducted on the RWTH-PHOENIX-Weather 2014T dataset show that SimulSLT achieves BLEU scores that exceed the latest end-to-end non-simultaneous sign language translation model while maintaining low latency, which proves the effectiveness of our method.

Related papers

Advanced Sign Language Video Generation with Compressed and Quantized Multi-Condition Tokenization [13.619845845897947]
SignViP is a novel framework that incorporates multiple fine-grained conditions for improved generation fidelity.<n>SignViP achieves state-of-the-art performance across metrics, including video quality temporal coherence, and semantic fidelity.
arXiv Detail & Related papers (2025-06-19T02:56:06Z)
Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues [56.038123093599815]
Our objective is to translate continuous sign language into spoken language text. We incorporate additional contextual cues together with the signing video. We show that our contextual approach significantly enhances the quality of the translations.
arXiv Detail & Related papers (2025-01-16T18:59:03Z)
Gloss2Text: Sign Language Gloss translation using LLMs and Semantically Aware Label Smoothing [21.183453511034767]
We propose several advances by leveraging pre-trained large language models (LLMs), data augmentation, and novel label-smoothing loss function. Our approach surpasses state-of-the-art performance in em Gloss2Text translation.
arXiv Detail & Related papers (2024-07-01T15:46:45Z)
T2S-GPT: Dynamic Vector Quantization for Autoregressive Sign Language Production from Text [59.57676466961787]
We propose a novel dynamic vector quantization (DVA-VAE) model that can adjust the encoding length based on the information density in sign language. Experiments conducted on the PHOENIX14T dataset demonstrate the effectiveness of our proposed method. We propose a new large German sign language dataset, PHOENIX-News, which contains 486 hours of sign language videos, audio, and transcription texts.
arXiv Detail & Related papers (2024-06-11T10:06:53Z)
Unsupervised Sign Language Translation and Generation [72.01216288379072]
We introduce an unsupervised sign language translation and generation network (USLNet) USLNet learns from abundant single-modality (text and video) data without parallel sign language data. We propose a sliding window method to address the issues of aligning variable-length text with video sequences.
arXiv Detail & Related papers (2024-02-12T15:39:05Z)
Is context all you need? Scaling Neural Sign Language Translation to Large Domains of Discourse [34.70927441846784]
Sign Language Translation (SLT) is a challenging task that aims to generate spoken language sentences from sign language videos. We propose a novel multi-modal transformer architecture that tackles the translation task in a context-aware manner, as a human would. We report significant improvements on state-of-the-art translation performance using contextual information, nearly doubling the reported BLEU-4 scores of baseline approaches.
arXiv Detail & Related papers (2023-08-18T15:27:22Z)
Gloss-free Sign Language Translation: Improving from Visual-Language Pretraining [56.26550923909137]
Gloss-Free Sign Language Translation (SLT) is a challenging task due to its cross-domain nature. We propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-) Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual and Text Decoder from
arXiv Detail & Related papers (2023-07-27T10:59:18Z)
Cross-modality Data Augmentation for End-to-End Sign Language Translation [66.46877279084083]
End-to-end sign language translation (SLT) aims to convert sign language videos into spoken language texts directly without intermediate representations. It has been a challenging task due to the modality gap between sign videos and texts and the data scarcity of labeled data. We propose a novel Cross-modality Data Augmentation (XmDA) framework to transfer the powerful gloss-to-text translation capabilities to end-to-end sign language translation.
arXiv Detail & Related papers (2023-05-18T16:34:18Z)
SLTUNET: A Simple Unified Model for Sign Language Translation [40.93099095994472]
We propose a simple unified neural model designed to support multiple sign-to-gloss, gloss-to-text and sign-to-text translation tasks. Jointly modeling different tasks endows SLTUNET with the capability to explore the cross-task relatedness that could help narrow the modality gap. We show in experiments that SLTUNET achieves competitive and even state-of-the-art performance on ENIX-2014T and CSL-Daily.
arXiv Detail & Related papers (2023-05-02T20:41:59Z)
A Simple Multi-Modality Transfer Learning Baseline for Sign Language Translation [54.29679610921429]
Existing sign language datasets contain only about 10K-20K pairs of sign videos, gloss annotations and texts. Data is thus a bottleneck for training effective sign language translation models. This simple baseline surpasses the previous state-of-the-art results on two sign language translation benchmarks.
arXiv Detail & Related papers (2022-03-08T18:59:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.