Automatic dense annotation of large-vocabulary sign language videos
- URL: http://arxiv.org/abs/2208.02802v1
- Date: Thu, 4 Aug 2022 17:55:09 GMT
- Title: Automatic dense annotation of large-vocabulary sign language videos
- Authors: Liliane Momeni, Hannah Bull, K R Prajwal, Samuel Albanie, G\"ul Varol,
Andrew Zisserman
- Abstract summary: We propose a simple, scalable framework to vastly increase the density of automatic annotations.
We make these annotations publicly available to support the sign language research community.
- Score: 85.61513254261523
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, sign language researchers have turned to sign language interpreted
TV broadcasts, comprising (i) a video of continuous signing and (ii) subtitles
corresponding to the audio content, as a readily available and large-scale
source of training data. One key challenge in the usability of such data is the
lack of sign annotations. Previous work exploiting such weakly-aligned data
only found sparse correspondences between keywords in the subtitle and
individual signs. In this work, we propose a simple, scalable framework to
vastly increase the density of automatic annotations. Our contributions are the
following: (1) we significantly improve previous annotation methods by making
use of synonyms and subtitle-signing alignment; (2) we show the value of
pseudo-labelling from a sign recognition model as a way of sign spotting; (3)
we propose a novel approach for increasing our annotations of known and unknown
classes based on in-domain exemplars; (4) on the BOBSL BSL sign language
corpus, we increase the number of confident automatic annotations from 670K to
5M. We make these annotations publicly available to support the sign language
research community.
Related papers
- Gloss Alignment Using Word Embeddings [40.100782464872076]
We propose a method for aligning spottings with their corresponding subtitles using large spoken language models.
We quantitatively demonstrate the effectiveness of our method on the acfmdgs and acfbobsl datasets.
arXiv Detail & Related papers (2023-08-08T13:26:53Z) - Weakly-supervised Fingerspelling Recognition in British Sign Language
Videos [85.61513254261523]
Previous fingerspelling recognition methods have not focused on British Sign Language (BSL)
In contrast to previous methods, our method only uses weak annotations from subtitles for training.
We propose a Transformer architecture adapted to this task, with a multiple-hypothesis CTC loss function to learn from alternative annotation possibilities.
arXiv Detail & Related papers (2022-11-16T15:02:36Z) - Sign Language Video Retrieval with Free-Form Textual Queries [19.29003565494735]
We introduce the task of sign language retrieval with free-form textual queries.
The objective is to find the signing video in the collection that best matches the written query.
We propose SPOT-ALIGN, a framework for interleaving iterative rounds of sign spotting and feature alignment to expand the scope and scale of available training data.
arXiv Detail & Related papers (2022-01-07T15:22:18Z) - Read and Attend: Temporal Localisation in Sign Language Videos [84.30262812057994]
We train a Transformer model to ingest a continuous signing stream and output a sequence of written tokens.
We show that it acquires the ability to attend to a large vocabulary of sign instances in the input sequence, enabling their localisation.
arXiv Detail & Related papers (2021-03-30T16:39:53Z) - Watch, read and lookup: learning to spot signs from multiple supervisors [99.50956498009094]
Given a video of an isolated sign, our task is to identify whether and where it has been signed in a continuous, co-articulated sign language video.
We train a model using multiple types of available supervision by: (1) watching existing sparsely labelled footage; (2) reading associated subtitles which provide additional weak-supervision; and (3) looking up words in visual sign language dictionaries.
These three tasks are integrated into a unified learning framework using the principles of Noise Contrastive Estimation and Multiple Instance Learning.
arXiv Detail & Related papers (2020-10-08T14:12:56Z) - BSL-1K: Scaling up co-articulated sign language recognition using
mouthing cues [106.21067543021887]
We show how to use mouthing cues from signers to obtain high-quality annotations from video data.
The BSL-1K dataset is a collection of British Sign Language (BSL) signs of unprecedented scale.
arXiv Detail & Related papers (2020-07-23T16:59:01Z) - Transferring Cross-domain Knowledge for Video Sign Language Recognition [103.9216648495958]
Word-level sign language recognition (WSLR) is a fundamental task in sign language interpretation.
We propose a novel method that learns domain-invariant visual concepts and fertilizes WSLR models by transferring knowledge of subtitled news sign to them.
arXiv Detail & Related papers (2020-03-08T03:05:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.