Related papers: Contrastive Pretraining with Dual Visual Encoders for Gloss-Free Sign Language Translation

Contrastive Pretraining with Dual Visual Encoders for Gloss-Free Sign Language Translation

URL: http://arxiv.org/abs/2507.10306v1
Date: Mon, 14 Jul 2025 14:09:36 GMT
Title: Contrastive Pretraining with Dual Visual Encoders for Gloss-Free Sign Language Translation
Authors: Ozge Mercanoglu Sincan, Richard Bowden,
Abstract summary: Sign Language Translation (SLT) aims to convert sign language videos into spoken or written text.<n>We propose a two-phase, dual visual encoder framework for gloss-free SLT.
Score: 33.48154010885497
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Sign Language Translation (SLT) aims to convert sign language videos into spoken or written text. While early systems relied on gloss annotations as an intermediate supervision, such annotations are costly to obtain and often fail to capture the full complexity of continuous signing. In this work, we propose a two-phase, dual visual encoder framework for gloss-free SLT, leveraging contrastive visual-language pretraining. During pretraining, our approach employs two complementary visual backbones whose outputs are jointly aligned with each other and with sentence-level text embeddings via a contrastive objective. During the downstream SLT task, we fuse the visual features and input them into an encoder-decoder model. On the Phoenix-2014T benchmark, our dual encoder architecture consistently outperforms its single stream variants and achieves the highest BLEU-4 score among existing gloss-free SLT approaches.

Related papers

SAGE: Segment-Aware Gloss-Free Encoding for Token-Efficient Sign Language Translation [29.79050316749927]
We propose a segment-aware visual tokenization framework to convert continuous video into discrete, sign-informed visual tokens.<n>This reduces input sequence length by up to 50% compared to prior methods, resulting in up to 2.67x lower memory usage.<n>Our approach notably exceeds the performance of state-of-the-art methods on the PHOENIX14T benchmark.
arXiv Detail & Related papers (2025-07-12T12:18:34Z)
VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models [57.2662376527586]
VScan is a two-stage visual token reduction framework.<n>It addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model.<n>VScan achieves a 2.91$times$ speedup in prefilling and a 10$times$ reduction in FLOPs, while retaining 95.4% of the original performance.
arXiv Detail & Related papers (2025-05-28T17:59:08Z)
Bridging Sign and Spoken Languages: Pseudo Gloss Generation for Sign Language Translation [48.20483623444857]
Sign Language Translation aims to map sign language videos to spoken language text.<n>A common approach relies on gloss annotations as an intermediate representation.<n>We propose a gloss-free pseudo gloss generation framework that eliminates the need for human-annotated glosses.
arXiv Detail & Related papers (2025-05-21T12:19:55Z)
CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions [31.624782806591682]
We introduce two simple yet effective designs to better leverage richly described synthetic captions. First, we observe a strong inverse effect in learning with synthetic captions. Second, we incorporate an autoregressive captioner to mimic the recaptioning process.
arXiv Detail & Related papers (2024-11-25T18:49:02Z)
A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision [74.972172804514]
We introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text. New dataset annotations provide continuous sign-level annotations for six hours of test videos, and will be made publicly available. Our model significantly outperforms the previous state of the art on both tasks.
arXiv Detail & Related papers (2024-05-16T17:19:06Z)
SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by Visual-Textual Contrastive Learning [51.800031281177105]
SignVTCL is a continuous sign language recognition framework enhanced by visual-textual contrastive learning. It integrates multi-modal data (video, keypoints, and optical flow) simultaneously to train a unified visual backbone. It achieves state-of-the-art results compared with previous methods.
arXiv Detail & Related papers (2024-01-22T11:04:55Z)
Gloss-free Sign Language Translation: Improving from Visual-Language Pretraining [56.26550923909137]
Gloss-Free Sign Language Translation (SLT) is a challenging task due to its cross-domain nature. We propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-) Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual and Text Decoder from
arXiv Detail & Related papers (2023-07-27T10:59:18Z)
Two-Stream Network for Sign Language Recognition and Translation [38.43767031555092]
We introduce a dual visual encoder containing two separate streams to model both the raw videos and the keypoint sequences. The resulting model is called TwoStream- SLR, which is competent for sign language recognition. TwoStream-SLT is extended to a sign language translation model, TwoStream-SLT, by simply attaching an extra translation network.
arXiv Detail & Related papers (2022-11-02T17:59:58Z)
Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network [99.03895740754402]
We propose a two-stream decoupled design of encoder-decoder structure, in which two decoupled cross-modal encoder and decoder are involved. As an alternative, we propose a primary scheduled sampling strategy that mitigates such discrepancy via pretraining encoder-decoder in a two-pass manner.
arXiv Detail & Related papers (2021-01-27T17:36:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.