Improving Gloss-free Sign Language Translation by Reducing Representation Density
- URL: http://arxiv.org/abs/2405.14312v2
- Date: Mon, 28 Oct 2024 19:33:30 GMT
- Title: Improving Gloss-free Sign Language Translation by Reducing Representation Density
- Authors: Jinhui Ye, Xing Wang, Wenxiang Jiao, Junwei Liang, Hui Xiong,
- Abstract summary: Gloss-free sign language translation (SLT) aims to develop well-performing SLT systems with no requirement for the costly gloss annotations.
We identify a representation density problem that could be a bottleneck in restricting the performance of gloss-free SLT.
We introduce a contrastive learning strategy, namely SignCL, which encourages gloss-free models to learn more discriminative feature representation.
- Score: 38.24463842418624
- License:
- Abstract: Gloss-free sign language translation (SLT) aims to develop well-performing SLT systems with no requirement for the costly gloss annotations, but currently still lags behind gloss-based approaches significantly. In this paper, we identify a representation density problem that could be a bottleneck in restricting the performance of gloss-free SLT. Specifically, the representation density problem describes that the visual representations of semantically distinct sign gestures tend to be closely packed together in feature space, which makes gloss-free methods struggle with distinguishing different sign gestures and suffer from a sharp performance drop. To address the representation density problem, we introduce a simple but effective contrastive learning strategy, namely SignCL, which encourages gloss-free models to learn more discriminative feature representation in a self-supervised manner. Our experiments demonstrate that the proposed SignCL can significantly reduce the representation density and improve performance across various translation frameworks. Specifically, SignCL achieves a significant improvement in BLEU score for the Sign Language Transformer and GFSLT-VLP on the CSL-Daily dataset by 39% and 46%, respectively, without any increase of model parameters. Compared to Sign2GPT, a state-of-the-art method based on large-scale pre-trained vision and language models, SignCL achieves better performance with only 35% of its parameters. Implementation and Checkpoints are available at https://github.com/JinhuiYE/SignCL.
Related papers
- DiffSLT: Enhancing Diversity in Sign Language Translation via Diffusion Model [9.452839238264286]
We propose DiffSLT, a novel gloss-free sign language translation framework.
DiffSLT transforms random noise into the target latent representation conditioned on the visual features of input video.
We also introduce DiffSLT-P, a DiffSLT variant that conditions on pseudo-glosses visual features, providing key textual guidance and reducing the modality gap.
arXiv Detail & Related papers (2024-11-26T09:26:36Z) - C${^2}$RL: Content and Context Representation Learning for Gloss-free Sign Language Translation and Retrieval [37.12863427950066]
We introduce an innovative pretraining paradigm for gloss-free SLRL, called C$2$RL.
C$2$RL improves the BLEU-4 score by +5.3 on P14T, +10.6 on CSL-daily, +6.2 on OpenASL, and +1.3 on How2Sign.
It also boosts the R@1 score by +8.3 on P14T, +14.4 on CSL-daily, and +5.9 on How2Sign.
arXiv Detail & Related papers (2024-08-19T12:42:10Z) - Gloss-free Sign Language Translation: Improving from Visual-Language
Pretraining [56.26550923909137]
Gloss-Free Sign Language Translation (SLT) is a challenging task due to its cross-domain nature.
We propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-)
Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual and Text Decoder from
arXiv Detail & Related papers (2023-07-27T10:59:18Z) - Gloss Attention for Gloss-free Sign Language Translation [60.633146518820325]
We show how gloss annotations make sign language translation easier.
We then propose emphgloss attention, which enables the model to keep its attention within video segments that have the same semantics locally.
Experimental results on multiple large-scale sign language datasets show that our proposed GASLT model significantly outperforms existing methods.
arXiv Detail & Related papers (2023-07-14T14:07:55Z) - Gloss-Free End-to-End Sign Language Translation [59.28829048788345]
We design the Gloss-Free End-to-end sign language translation framework (GloFE)
Our method improves the performance of SLT in the gloss-free setting by exploiting the shared underlying semantics of signs and the corresponding spoken translation.
We obtained state-of-the-art results on large-scale datasets, including OpenASL and How2Sign.
arXiv Detail & Related papers (2023-05-22T09:57:43Z) - Alleviating Over-smoothing for Unsupervised Sentence Representation [96.19497378628594]
We present a Simple method named Self-Contrastive Learning (SSCL) to alleviate this issue.
Our proposed method is quite simple and can be easily extended to various state-of-the-art models for performance boosting.
arXiv Detail & Related papers (2023-05-09T11:00:02Z) - Natural Language-Assisted Sign Language Recognition [28.64871971445024]
We propose the Natural Language-Assisted Sign Language Recognition framework.
It exploits semantic information contained in glosses (sign labels) to mitigate the problem of visually indistinguishable signs (VISigns) in sign languages.
Our method achieves state-of-the-art performance on three widely-adopted benchmarks: MSASL, WLASL, and NMFs-CSL.
arXiv Detail & Related papers (2023-03-21T17:59:57Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.