BSL-1K: Scaling up co-articulated sign language recognition using
mouthing cues
- URL: http://arxiv.org/abs/2007.12131v2
- Date: Wed, 13 Oct 2021 17:13:42 GMT
- Title: BSL-1K: Scaling up co-articulated sign language recognition using
mouthing cues
- Authors: Samuel Albanie and G\"ul Varol and Liliane Momeni and Triantafyllos
Afouras and Joon Son Chung and Neil Fox and Andrew Zisserman
- Abstract summary: We show how to use mouthing cues from signers to obtain high-quality annotations from video data.
The BSL-1K dataset is a collection of British Sign Language (BSL) signs of unprecedented scale.
- Score: 106.21067543021887
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent progress in fine-grained gesture and action classification, and
machine translation, point to the possibility of automated sign language
recognition becoming a reality. A key stumbling block in making progress
towards this goal is a lack of appropriate training data, stemming from the
high complexity of sign annotation and a limited supply of qualified
annotators. In this work, we introduce a new scalable approach to data
collection for sign recognition in continuous videos. We make use of
weakly-aligned subtitles for broadcast footage together with a keyword spotting
method to automatically localise sign-instances for a vocabulary of 1,000 signs
in 1,000 hours of video. We make the following contributions: (1) We show how
to use mouthing cues from signers to obtain high-quality annotations from video
data - the result is the BSL-1K dataset, a collection of British Sign Language
(BSL) signs of unprecedented scale; (2) We show that we can use BSL-1K to train
strong sign recognition models for co-articulated signs in BSL and that these
models additionally form excellent pretraining for other sign languages and
benchmarks - we exceed the state of the art on both the MSASL and WLASL
benchmarks. Finally, (3) we propose new large-scale evaluation sets for the
tasks of sign recognition and sign spotting and provide baselines which we hope
will serve to stimulate research in this area.
Related papers
- Improving Continuous Sign Language Recognition with Cross-Lingual Signs [29.077175863743484]
We study the feasibility of utilizing multilingual sign language corpora to facilitate continuous sign language recognition.
We first build two sign language dictionaries containing isolated signs that appear in two datasets.
Then we identify the sign-to-sign mappings between two sign languages via a well-optimized isolated sign language recognition model.
arXiv Detail & Related papers (2023-08-21T15:58:47Z) - Weakly-supervised Fingerspelling Recognition in British Sign Language
Videos [85.61513254261523]
Previous fingerspelling recognition methods have not focused on British Sign Language (BSL)
In contrast to previous methods, our method only uses weak annotations from subtitles for training.
We propose a Transformer architecture adapted to this task, with a multiple-hypothesis CTC loss function to learn from alternative annotation possibilities.
arXiv Detail & Related papers (2022-11-16T15:02:36Z) - Automatic dense annotation of large-vocabulary sign language videos [85.61513254261523]
We propose a simple, scalable framework to vastly increase the density of automatic annotations.
We make these annotations publicly available to support the sign language research community.
arXiv Detail & Related papers (2022-08-04T17:55:09Z) - Scaling up sign spotting through sign language dictionaries [99.50956498009094]
The focus of this work is $textitsign spotting$ - given a video of an isolated sign, our task is to identify $textitwhether$ and $textitwhere$ it has been signed in a continuous, co-articulated sign language video.
We train a model using multiple types of available supervision by: (1) $textitwatching$ existing footage which is sparsely labelled using mouthing cues; (2) $textitreading$ associated subtitles which provide additional translations of the signed content.
We validate the effectiveness of our approach on low
arXiv Detail & Related papers (2022-05-09T10:00:03Z) - Read and Attend: Temporal Localisation in Sign Language Videos [84.30262812057994]
We train a Transformer model to ingest a continuous signing stream and output a sequence of written tokens.
We show that it acquires the ability to attend to a large vocabulary of sign instances in the input sequence, enabling their localisation.
arXiv Detail & Related papers (2021-03-30T16:39:53Z) - Watch, read and lookup: learning to spot signs from multiple supervisors [99.50956498009094]
Given a video of an isolated sign, our task is to identify whether and where it has been signed in a continuous, co-articulated sign language video.
We train a model using multiple types of available supervision by: (1) watching existing sparsely labelled footage; (2) reading associated subtitles which provide additional weak-supervision; and (3) looking up words in visual sign language dictionaries.
These three tasks are integrated into a unified learning framework using the principles of Noise Contrastive Estimation and Multiple Instance Learning.
arXiv Detail & Related papers (2020-10-08T14:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.