Related papers: Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing

Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing

URL: http://arxiv.org/abs/2512.08094v1
Date: Mon, 08 Dec 2025 23:07:48 GMT
Title: Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing
Authors: Zifan Jiang, Youngjoon Jang, Liliane Momeni, Gül Varol, Sarah Ebling, Andrew Zisserman,
Abstract summary: Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains.<n>SEA leverages two pretrained models: the first to segment a video frame sequence into individual signs and the second to embed the video clip of each sign into a shared latent space with text.<n> Alignment is performed with a lightweight dynamic programming procedure that runs efficiently on CPUs within a minute, even for hour-long episodes.
Score: 60.9289697082021
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video frame sequence into individual signs and the second to embed the video clip of each sign into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPUs within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large continuous corpora. Experiments on four sign language datasets demonstrate state-of-the-art alignment performance, highlighting the potential of SEA to generate high-quality parallel data for advancing sign language processing. SEA's code and models are openly available.

Related papers

Lost in Translation, Found in Embeddings: Sign Language Translation and Alignment [84.39962912136525]
We develop a model for sign language understanding that performs sign language translation (SLT) and sign-subtitle alignment (SSA)<n>Our approach is built upon three components: (i) a lightweight visual backbone that captures manual and non-manual cues from human keypoints and lip-region images; (ii) a Sliding Perceiver mapping network that aggregates consecutive visual features into word-level embeddings; and (iii) a multi-task scalable training strategy that jointly optimises SLT and SSA.
arXiv Detail & Related papers (2025-12-08T21:05:46Z)
Seeing, Signing, and Saying: A Vision-Language Model-Assisted Pipeline for Sign Language Data Acquisition and Curation from Social Media [14.920401718793643]
Vision Language Models (VLMs) have demonstrated strong capabilities as evaluators and real-time assistants.<n>We introduce the first automated annotation and filtering framework that utilizeVLMs to reduce reliance on manual effort while preserving data quality.<n>Our method is applied to TikTok videos across eight sign languages and to the already curated YouTube-SL-25 dataset in German Sign Language.
arXiv Detail & Related papers (2025-10-29T11:29:56Z)
InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models [37.43195217391341]
In this paper, we define the union of segmentation and reasoning segmentation at both the image and video levels as Instructed Visual (IVS)<n>Specifically, we employ an object-aware video perceiver to extract temporal and object information from reference frames, facilitating comprehensive video understanding.<n>By leveraging multi-task and end-to-end training, InstructSeg demonstrates superior performance across diverse image and video segmentation tasks.
arXiv Detail & Related papers (2024-12-18T16:20:40Z)
SignCLIP: Connecting Text and Sign Language by Contrastive Learning [39.72545568965546]
SignCLIP is an efficient method of learning useful visual representations for sign language processing from large-scale, multilingual video-text pairs. We pretrain SignCLIP on Spreadthesign, a prominent sign language dictionary consisting of 500 thousand video clips in up to 44 sign languages. We analyze the latent space formed by the spoken language text and sign language poses, which provides additional linguistic insights.
arXiv Detail & Related papers (2024-07-01T13:17:35Z)
T2S-GPT: Dynamic Vector Quantization for Autoregressive Sign Language Production from Text [59.57676466961787]
We propose a novel dynamic vector quantization (DVA-VAE) model that can adjust the encoding length based on the information density in sign language. Experiments conducted on the PHOENIX14T dataset demonstrate the effectiveness of our proposed method. We propose a new large German sign language dataset, PHOENIX-News, which contains 486 hours of sign language videos, audio, and transcription texts.
arXiv Detail & Related papers (2024-06-11T10:06:53Z)
OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens. This enables us to address various types of video tasks, including classification, captioning, and localization. We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z)
VidLA: Video-Language Alignment at Scale [48.665918882615195]
We propose VidLA, an approach for video-language alignment at scale. Our proposed approach surpasses state-of-the-art methods on multiple retrieval benchmarks.
arXiv Detail & Related papers (2024-03-21T22:36:24Z)
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events. Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments. Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z)
Aligning Subtitles in Sign Language Videos [80.20961722170655]
We train on manually annotated alignments covering over 15K subtitles that span 17.7 hours of video. We use BERT subtitle embeddings and CNN video representations learned for sign recognition to encode the two signals. Our model outputs frame-level predictions, i.e., for each video frame, whether it belongs to the queried subtitle or not.
arXiv Detail & Related papers (2021-05-06T17:59:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.