Gloss-Free End-to-End Sign Language Translation
- URL: http://arxiv.org/abs/2305.12876v2
- Date: Sat, 27 May 2023 16:43:18 GMT
- Title: Gloss-Free End-to-End Sign Language Translation
- Authors: Kezhou Lin, Xiaohan Wang, Linchao Zhu, Ke Sun, Bang Zhang, Yi Yang
- Abstract summary: We design the Gloss-Free End-to-end sign language translation framework (GloFE)
Our method improves the performance of SLT in the gloss-free setting by exploiting the shared underlying semantics of signs and the corresponding spoken translation.
We obtained state-of-the-art results on large-scale datasets, including OpenASL and How2Sign.
- Score: 59.28829048788345
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we tackle the problem of sign language translation (SLT)
without gloss annotations. Although intermediate representation like gloss has
been proven effective, gloss annotations are hard to acquire, especially in
large quantities. This limits the domain coverage of translation datasets, thus
handicapping real-world applications. To mitigate this problem, we design the
Gloss-Free End-to-end sign language translation framework (GloFE). Our method
improves the performance of SLT in the gloss-free setting by exploiting the
shared underlying semantics of signs and the corresponding spoken translation.
Common concepts are extracted from the text and used as a weak form of
intermediate representation. The global embedding of these concepts is used as
a query for cross-attention to find the corresponding information within the
learned visual features. In a contrastive manner, we encourage the similarity
of query results between samples containing such concepts and decrease those
that do not. We obtained state-of-the-art results on large-scale datasets,
including OpenASL and How2Sign. The code and model will be available at
https://github.com/HenryLittle/GloFE.
Related papers
- Improving Gloss-free Sign Language Translation by Reducing Representation Density [38.24463842418624]
Gloss-free sign language translation (SLT) aims to develop well-performing SLT systems with no requirement for the costly gloss annotations.
We identify a representation density problem that could be a bottleneck in restricting the performance of gloss-free SLT.
We introduce a contrastive learning strategy, namely SignCL, which encourages gloss-free models to learn more discriminative feature representation.
arXiv Detail & Related papers (2024-05-23T08:32:58Z) - Gloss-free Sign Language Translation: Improving from Visual-Language
Pretraining [56.26550923909137]
Gloss-Free Sign Language Translation (SLT) is a challenging task due to its cross-domain nature.
We propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-)
Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual and Text Decoder from
arXiv Detail & Related papers (2023-07-27T10:59:18Z) - Gloss Attention for Gloss-free Sign Language Translation [60.633146518820325]
We show how gloss annotations make sign language translation easier.
We then propose emphgloss attention, which enables the model to keep its attention within video segments that have the same semantics locally.
Experimental results on multiple large-scale sign language datasets show that our proposed GASLT model significantly outperforms existing methods.
arXiv Detail & Related papers (2023-07-14T14:07:55Z) - Cross-modality Data Augmentation for End-to-End Sign Language Translation [66.46877279084083]
End-to-end sign language translation (SLT) aims to convert sign language videos into spoken language texts directly without intermediate representations.
It has been a challenging task due to the modality gap between sign videos and texts and the data scarcity of labeled data.
We propose a novel Cross-modality Data Augmentation (XmDA) framework to transfer the powerful gloss-to-text translation capabilities to end-to-end sign language translation.
arXiv Detail & Related papers (2023-05-18T16:34:18Z) - Considerations for meaningful sign language machine translation based on
glosses [6.422262171968398]
In machine translation (MT), sign language translation based on glosses is a prominent approach.
We find that limitations of glosses in general and limitations of specific datasets are not discussed in a transparent manner.
We put forward concrete recommendations for future research on gloss translation.
arXiv Detail & Related papers (2022-11-28T15:51:58Z) - A Simple Multi-Modality Transfer Learning Baseline for Sign Language
Translation [54.29679610921429]
Existing sign language datasets contain only about 10K-20K pairs of sign videos, gloss annotations and texts.
Data is thus a bottleneck for training effective sign language translation models.
This simple baseline surpasses the previous state-of-the-art results on two sign language translation benchmarks.
arXiv Detail & Related papers (2022-03-08T18:59:56Z) - Improving Sign Language Translation with Monolingual Data by Sign
Back-Translation [105.83166521438463]
We propose a sign back-translation (SignBT) approach, which incorporates massive spoken language texts into sign training.
With a text-to-gloss translation model, we first back-translate the monolingual text to its gloss sequence.
Then, the paired sign sequence is generated by splicing pieces from an estimated gloss-to-sign bank at the feature level.
arXiv Detail & Related papers (2021-05-26T08:49:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.