Unsupervised Sign Language Translation and Generation
- URL: http://arxiv.org/abs/2402.07726v1
- Date: Mon, 12 Feb 2024 15:39:05 GMT
- Title: Unsupervised Sign Language Translation and Generation
- Authors: Zhengsheng Guo, Zhiwei He, Wenxiang Jiao, Xing Wang, Rui Wang, Kehai
Chen, Zhaopeng Tu, Yong Xu, Min Zhang
- Abstract summary: We introduce an unsupervised sign language translation and generation network (USLNet)
USLNet learns from abundant single-modality (text and video) data without parallel sign language data.
We propose a sliding window method to address the issues of aligning variable-length text with video sequences.
- Score: 72.01216288379072
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Motivated by the success of unsupervised neural machine translation (UNMT),
we introduce an unsupervised sign language translation and generation network
(USLNet), which learns from abundant single-modality (text and video) data
without parallel sign language data. USLNet comprises two main components:
single-modality reconstruction modules (text and video) that rebuild the input
from its noisy version in the same modality and cross-modality back-translation
modules (text-video-text and video-text-video) that reconstruct the input from
its noisy version in the different modality using back-translation
procedure.Unlike the single-modality back-translation procedure in text-based
UNMT, USLNet faces the cross-modality discrepancy in feature representation, in
which the length and the feature dimension mismatch between text and video
sequences. We propose a sliding window method to address the issues of aligning
variable-length text with video sequences. To our knowledge, USLNet is the
first unsupervised sign language translation and generation model capable of
generating both natural language text and sign language video in a unified
manner. Experimental results on the BBC-Oxford Sign Language dataset (BOBSL)
and Open-Domain American Sign Language dataset (OpenASL) reveal that USLNet
achieves competitive results compared to supervised baseline models, indicating
its effectiveness in sign language translation and generation.
Related papers
- Is context all you need? Scaling Neural Sign Language Translation to
Large Domains of Discourse [34.70927441846784]
Sign Language Translation (SLT) is a challenging task that aims to generate spoken language sentences from sign language videos.
We propose a novel multi-modal transformer architecture that tackles the translation task in a context-aware manner, as a human would.
We report significant improvements on state-of-the-art translation performance using contextual information, nearly doubling the reported BLEU-4 scores of baseline approaches.
arXiv Detail & Related papers (2023-08-18T15:27:22Z) - Cross-modality Data Augmentation for End-to-End Sign Language Translation [66.46877279084083]
End-to-end sign language translation (SLT) aims to convert sign language videos into spoken language texts directly without intermediate representations.
It has been a challenging task due to the modality gap between sign videos and texts and the data scarcity of labeled data.
We propose a novel Cross-modality Data Augmentation (XmDA) framework to transfer the powerful gloss-to-text translation capabilities to end-to-end sign language translation.
arXiv Detail & Related papers (2023-05-18T16:34:18Z) - SLTUNET: A Simple Unified Model for Sign Language Translation [40.93099095994472]
We propose a simple unified neural model designed to support multiple sign-to-gloss, gloss-to-text and sign-to-text translation tasks.
Jointly modeling different tasks endows SLTUNET with the capability to explore the cross-task relatedness that could help narrow the modality gap.
We show in experiments that SLTUNET achieves competitive and even state-of-the-art performance on ENIX-2014T and CSL-Daily.
arXiv Detail & Related papers (2023-05-02T20:41:59Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - SimulSLT: End-to-End Simultaneous Sign Language Translation [55.54237194555432]
Existing sign language translation methods need to read all the videos before starting the translation.
We propose SimulSLT, the first end-to-end simultaneous sign language translation model.
SimulSLT achieves BLEU scores that exceed the latest end-to-end non-simultaneous sign language translation model.
arXiv Detail & Related papers (2021-12-08T11:04:52Z) - VX2TEXT: End-to-End Learning of Video-Based Text Generation From
Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z) - Unsupervised Multimodal Video-to-Video Translation via Self-Supervised
Learning [92.17835753226333]
We propose a novel unsupervised video-to-video translation model.
Our model decomposes the style and the content using the specialized UV-decoder structure.
Our model can produce photo-realistic videos in a multimodal way.
arXiv Detail & Related papers (2020-04-14T13:44:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.