Gloss Alignment Using Word Embeddings
- URL: http://arxiv.org/abs/2308.04248v1
- Date: Tue, 8 Aug 2023 13:26:53 GMT
- Title: Gloss Alignment Using Word Embeddings
- Authors: Harry Walsh, Ozge Mercanoglu Sincan, Ben Saunders, Richard Bowden
- Abstract summary: We propose a method for aligning spottings with their corresponding subtitles using large spoken language models.
We quantitatively demonstrate the effectiveness of our method on the acfmdgs and acfbobsl datasets.
- Score: 40.100782464872076
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Capturing and annotating Sign language datasets is a time consuming and
costly process. Current datasets are orders of magnitude too small to
successfully train unconstrained \acf{slt} models. As a result, research has
turned to TV broadcast content as a source of large-scale training data,
consisting of both the sign language interpreter and the associated audio
subtitle. However, lack of sign language annotation limits the usability of
this data and has led to the development of automatic annotation techniques
such as sign spotting. These spottings are aligned to the video rather than the
subtitle, which often results in a misalignment between the subtitle and
spotted signs. In this paper we propose a method for aligning spottings with
their corresponding subtitles using large spoken language models. Using a
single modality means our method is computationally inexpensive and can be
utilized in conjunction with existing alignment techniques. We quantitatively
demonstrate the effectiveness of our method on the \acf{mdgs} and \acf{bobsl}
datasets, recovering up to a 33.22 BLEU-1 score in word alignment.
Related papers
- Scalable and Accurate Self-supervised Multimodal Representation Learning
without Aligned Video and Text Data [18.479220305684837]
Recent advances in image captioning allow us to pre-train high-quality video models without parallel video-text data.
We show that image captioning pseudolabels work better for pre-training than the existing HowTo100M ASR captions.
arXiv Detail & Related papers (2023-04-04T19:11:05Z) - Automatic dense annotation of large-vocabulary sign language videos [85.61513254261523]
We propose a simple, scalable framework to vastly increase the density of automatic annotations.
We make these annotations publicly available to support the sign language research community.
arXiv Detail & Related papers (2022-08-04T17:55:09Z) - Scaling up sign spotting through sign language dictionaries [99.50956498009094]
The focus of this work is $textitsign spotting$ - given a video of an isolated sign, our task is to identify $textitwhether$ and $textitwhere$ it has been signed in a continuous, co-articulated sign language video.
We train a model using multiple types of available supervision by: (1) $textitwatching$ existing footage which is sparsely labelled using mouthing cues; (2) $textitreading$ associated subtitles which provide additional translations of the signed content.
We validate the effectiveness of our approach on low
arXiv Detail & Related papers (2022-05-09T10:00:03Z) - Aligning Subtitles in Sign Language Videos [80.20961722170655]
We train on manually annotated alignments covering over 15K subtitles that span 17.7 hours of video.
We use BERT subtitle embeddings and CNN video representations learned for sign recognition to encode the two signals.
Our model outputs frame-level predictions, i.e., for each video frame, whether it belongs to the queried subtitle or not.
arXiv Detail & Related papers (2021-05-06T17:59:36Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Watch, read and lookup: learning to spot signs from multiple supervisors [99.50956498009094]
Given a video of an isolated sign, our task is to identify whether and where it has been signed in a continuous, co-articulated sign language video.
We train a model using multiple types of available supervision by: (1) watching existing sparsely labelled footage; (2) reading associated subtitles which provide additional weak-supervision; and (3) looking up words in visual sign language dictionaries.
These three tasks are integrated into a unified learning framework using the principles of Noise Contrastive Estimation and Multiple Instance Learning.
arXiv Detail & Related papers (2020-10-08T14:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.