Conditional Variational Autoencoder for Sign Language Translation with
Cross-Modal Alignment
- URL: http://arxiv.org/abs/2312.15645v1
- Date: Mon, 25 Dec 2023 08:20:40 GMT
- Title: Conditional Variational Autoencoder for Sign Language Translation with
Cross-Modal Alignment
- Authors: Rui Zhao, Liang Zhang, Biao Fu, Cong Hu, Jinsong Su, Yidong Chen
- Abstract summary: Sign language translation (SLT) aims to convert continuous sign language videos into textual sentences.
We propose a novel framework based on Conditional Variational autoencoder for SLT (CV-SLT)
CV-SLT consists of two paths with two Kullback-Leibler divergences to regularize the outputs of the encoder and decoder.
- Score: 33.96363443363547
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sign language translation (SLT) aims to convert continuous sign language
videos into textual sentences. As a typical multi-modal task, there exists an
inherent modality gap between sign language videos and spoken language text,
which makes the cross-modal alignment between visual and textual modalities
crucial. However, previous studies tend to rely on an intermediate sign gloss
representation to help alleviate the cross-modal problem thereby neglecting the
alignment across modalities that may lead to compromised results. To address
this issue, we propose a novel framework based on Conditional Variational
autoencoder for SLT (CV-SLT) that facilitates direct and sufficient cross-modal
alignment between sign language videos and spoken language text. Specifically,
our CV-SLT consists of two paths with two Kullback-Leibler (KL) divergences to
regularize the outputs of the encoder and decoder, respectively. In the prior
path, the model solely relies on visual information to predict the target text;
whereas in the posterior path, it simultaneously encodes visual information and
textual knowledge to reconstruct the target text. The first KL divergence
optimizes the conditional variational autoencoder and regularizes the encoder
outputs, while the second KL divergence performs a self-distillation from the
posterior path to the prior path, ensuring the consistency of decoder outputs.
We further enhance the integration of textual information to the posterior path
by employing a shared Attention Residual Gaussian Distribution (ARGD), which
considers the textual information in the posterior path as a residual component
relative to the prior path. Extensive experiments conducted on public datasets
(PHOENIX14T and CSL-daily) demonstrate the effectiveness of our framework,
achieving new state-of-the-art results while significantly alleviating the
cross-modal representation discrepancy.
Related papers
- SignAttention: On the Interpretability of Transformer Models for Sign Language Translation [2.079808290618441]
This paper presents the first comprehensive interpretability analysis of a Transformer-based Sign Language Translation model.
We examine the attention mechanisms within the model to understand how it processes and aligns visual input with sequential glosses.
This work contributes to a deeper understanding of SLT models, paving the way for the development of more transparent and reliable translation systems.
arXiv Detail & Related papers (2024-10-18T14:38:37Z) - Unsupervised Sign Language Translation and Generation [72.01216288379072]
We introduce an unsupervised sign language translation and generation network (USLNet)
USLNet learns from abundant single-modality (text and video) data without parallel sign language data.
We propose a sliding window method to address the issues of aligning variable-length text with video sequences.
arXiv Detail & Related papers (2024-02-12T15:39:05Z) - Gloss-free Sign Language Translation: Improving from Visual-Language
Pretraining [56.26550923909137]
Gloss-Free Sign Language Translation (SLT) is a challenging task due to its cross-domain nature.
We propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-)
Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual and Text Decoder from
arXiv Detail & Related papers (2023-07-27T10:59:18Z) - Cross-modality Data Augmentation for End-to-End Sign Language Translation [66.46877279084083]
End-to-end sign language translation (SLT) aims to convert sign language videos into spoken language texts directly without intermediate representations.
It has been a challenging task due to the modality gap between sign videos and texts and the data scarcity of labeled data.
We propose a novel Cross-modality Data Augmentation (XmDA) framework to transfer the powerful gloss-to-text translation capabilities to end-to-end sign language translation.
arXiv Detail & Related papers (2023-05-18T16:34:18Z) - Levenshtein OCR [20.48454415635795]
A novel scene text recognizer based on Vision-Language Transformer (VLT) is presented.
Inspired by Levenshtein Transformer in the area of NLP, the proposed method explores an alternative way for automatically transcribing textual content from cropped natural images.
arXiv Detail & Related papers (2022-09-08T06:46:50Z) - VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix [59.25846149124199]
This paper proposes a data augmentation method, namely cross-modal CutMix.
CMC transforms natural sentences from the textual view into a multi-modal view.
By attaching cross-modal noise on uni-modal data, it guides models to learn token-level interactions across modalities for better denoising.
arXiv Detail & Related papers (2022-06-17T17:56:47Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - Visual-aware Attention Dual-stream Decoder for Video Captioning [12.139806877591212]
The attention mechanism in the current video captioning method learns to assign weight to each frame, promoting the decoder dynamically.
This may not explicitly model the correlation and the temporal coherence of the visual features extracted in the sequence frames.
We propose a new Visual-aware Attention (VA) model, which unifies changes of temporal sequence frames with the words at the previous moment.
The effectiveness of the proposed Visual-aware Attention Dual-stream Decoder (VADD) is demonstrated.
arXiv Detail & Related papers (2021-10-16T14:08:20Z) - Cross Modification Attention Based Deliberation Model for Image
Captioning [11.897899189552318]
We propose a universal two-pass decoding framework for image captioning.
A single-pass decoding based model first generates a draft caption according to an input image.
A Deliberation Model then performs the polishing process to refine the draft caption to a better image description.
arXiv Detail & Related papers (2021-09-17T08:38:08Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.