Aligning Subtitles in Sign Language Videos
- URL: http://arxiv.org/abs/2105.02877v1
- Date: Thu, 6 May 2021 17:59:36 GMT
- Title: Aligning Subtitles in Sign Language Videos
- Authors: Hannah Bull, Triantafyllos Afouras, G\"ul Varol, Samuel Albanie,
Liliane Momeni, Andrew Zisserman
- Abstract summary: We train on manually annotated alignments covering over 15K subtitles that span 17.7 hours of video.
We use BERT subtitle embeddings and CNN video representations learned for sign recognition to encode the two signals.
Our model outputs frame-level predictions, i.e., for each video frame, whether it belongs to the queried subtitle or not.
- Score: 80.20961722170655
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of this work is to temporally align asynchronous subtitles in sign
language videos. In particular, we focus on sign-language interpreted TV
broadcast data comprising (i) a video of continuous signing, and (ii) subtitles
corresponding to the audio content. Previous work exploiting such
weakly-aligned data only considered finding keyword-sign correspondences,
whereas we aim to localise a complete subtitle text in continuous signing. We
propose a Transformer architecture tailored for this task, which we train on
manually annotated alignments covering over 15K subtitles that span 17.7 hours
of video. We use BERT subtitle embeddings and CNN video representations learned
for sign recognition to encode the two signals, which interact through a series
of attention layers. Our model outputs frame-level predictions, i.e., for each
video frame, whether it belongs to the queried subtitle or not. Through
extensive evaluations, we show substantial improvements over existing alignment
baselines that do not make use of subtitle text embeddings for learning. Our
automatic alignment model opens up possibilities for advancing machine
translation of sign languages via providing continuously synchronized
video-text data.
Related papers
- Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [72.69268311756082]
We propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale.
We introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence.
We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
arXiv Detail & Related papers (2023-10-07T19:32:55Z) - Gloss Alignment Using Word Embeddings [40.100782464872076]
We propose a method for aligning spottings with their corresponding subtitles using large spoken language models.
We quantitatively demonstrate the effectiveness of our method on the acfmdgs and acfbobsl datasets.
arXiv Detail & Related papers (2023-08-08T13:26:53Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - Direct Speech Translation for Automatic Subtitling [17.095483965591267]
We propose the first direct ST model for automatic subtitling that generates subtitles in the target language along with their timestamps with a single model.
Our experiments on 7 language pairs show that our approach outperforms a cascade system in the same data condition.
arXiv Detail & Related papers (2022-09-27T06:47:42Z) - Automatic dense annotation of large-vocabulary sign language videos [85.61513254261523]
We propose a simple, scalable framework to vastly increase the density of automatic annotations.
We make these annotations publicly available to support the sign language research community.
arXiv Detail & Related papers (2022-08-04T17:55:09Z) - Between Flexibility and Consistency: Joint Generation of Captions and
Subtitles [13.58711830450618]
Speech translation (ST) has lately received growing interest for the generation of subtitles without the need for an intermediate source language transcription and timing.
In this work, we focus on ST models which generate consistent captions-subtitles in terms of structure and lexical content.
Our findings show that joint decoding leads to increased performance and consistency between the generated captions and subtitles while still allowing for sufficient flexibility to produce subtitles conforming to language-specific needs and norms.
arXiv Detail & Related papers (2021-07-13T17:06:04Z) - Read and Attend: Temporal Localisation in Sign Language Videos [84.30262812057994]
We train a Transformer model to ingest a continuous signing stream and output a sequence of written tokens.
We show that it acquires the ability to attend to a large vocabulary of sign instances in the input sequence, enabling their localisation.
arXiv Detail & Related papers (2021-03-30T16:39:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.