A Token-level Contrastive Framework for Sign Language Translation
- URL: http://arxiv.org/abs/2204.04916v3
- Date: Tue, 21 Mar 2023 12:58:01 GMT
- Title: A Token-level Contrastive Framework for Sign Language Translation
- Authors: Biao Fu, Peigen Ye, Liang Zhang, Pei Yu, Cong Hu, Yidong Chen,
Xiaodong Shi
- Abstract summary: Sign Language Translation is a promising technology to bridge the communication gap between the deaf and the hearing people.
We propose ConSLT, a novel token-level.
textbfContrastive learning framework for textbfSign textbfLanguage.
textbfTranslation.
- Score: 9.185037439012952
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sign Language Translation (SLT) is a promising technology to bridge the
communication gap between the deaf and the hearing people. Recently,
researchers have adopted Neural Machine Translation (NMT) methods, which
usually require large-scale corpus for training, to achieve SLT. However, the
publicly available SLT corpus is very limited, which causes the collapse of the
token representations and the inaccuracy of the generated tokens. To alleviate
this issue, we propose ConSLT, a novel token-level \textbf{Con}trastive
learning framework for \textbf{S}ign \textbf{L}anguage \textbf{T}ranslation ,
which learns effective token representations by incorporating token-level
contrastive learning into the SLT decoding process. Concretely, ConSLT treats
each token and its counterpart generated by different dropout masks as positive
pairs during decoding, and then randomly samples $K$ tokens in the vocabulary
that are not in the current sentence to construct negative examples. We conduct
comprehensive experiments on two benchmarks (PHOENIX14T and CSL-Daily) for both
end-to-end and cascaded settings. The experimental results demonstrate that
ConSLT can achieve better translation quality than the strong baselines.
Related papers
- CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder.
Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z) - Gloss-free Sign Language Translation: Improving from Visual-Language
Pretraining [56.26550923909137]
Gloss-Free Sign Language Translation (SLT) is a challenging task due to its cross-domain nature.
We propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-)
Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual and Text Decoder from
arXiv Detail & Related papers (2023-07-27T10:59:18Z) - Dual-Alignment Pre-training for Cross-lingual Sentence Embedding [79.98111074307657]
We propose a dual-alignment pre-training (DAP) framework for cross-lingual sentence embedding.
We introduce a novel representation translation learning (RTL) task, where the model learns to use one-side contextualized token representation to reconstruct its translation counterpart.
Our approach can significantly improve sentence embedding.
arXiv Detail & Related papers (2023-05-16T03:53:30Z) - VECO 2.0: Cross-lingual Language Model Pre-training with
Multi-granularity Contrastive Learning [56.47303426167584]
We propose a cross-lingual pre-trained model VECO2.0 based on contrastive learning with multi-granularity alignments.
Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs.
token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance.
arXiv Detail & Related papers (2023-04-17T12:23:41Z) - Weighted Sampling for Masked Language Modeling [12.25238763907731]
We propose two simple and effective Weighted Sampling strategies for masking tokens based on the token frequency and training loss.
We apply these two strategies to BERT and obtain Weighted-Sampled BERT (WSBERT)
arXiv Detail & Related papers (2023-02-28T01:07:39Z) - Improving Sign Language Translation with Monolingual Data by Sign
Back-Translation [105.83166521438463]
We propose a sign back-translation (SignBT) approach, which incorporates massive spoken language texts into sign training.
With a text-to-gloss translation model, we first back-translate the monolingual text to its gloss sequence.
Then, the paired sign sequence is generated by splicing pieces from an estimated gloss-to-sign bank at the feature level.
arXiv Detail & Related papers (2021-05-26T08:49:30Z) - "Listen, Understand and Translate": Triple Supervision Decouples
End-to-end Speech-to-text Translation [49.610188741500274]
An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language.
Existing methods are limited by the amount of parallel corpus.
We build a system to fully utilize signals in a parallel ST corpus.
arXiv Detail & Related papers (2020-09-21T09:19:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.