Sign Language Video Retrieval with Free-Form Textual Queries
- URL: http://arxiv.org/abs/2201.02495v1
- Date: Fri, 7 Jan 2022 15:22:18 GMT
- Title: Sign Language Video Retrieval with Free-Form Textual Queries
- Authors: Amanda Duarte, Samuel Albanie, Xavier Gir\'o-i-Nieto, G\"ul Varol
- Abstract summary: We introduce the task of sign language retrieval with free-form textual queries.
The objective is to find the signing video in the collection that best matches the written query.
We propose SPOT-ALIGN, a framework for interleaving iterative rounds of sign spotting and feature alignment to expand the scope and scale of available training data.
- Score: 19.29003565494735
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Systems that can efficiently search collections of sign language videos have
been highlighted as a useful application of sign language technology. However,
the problem of searching videos beyond individual keywords has received limited
attention in the literature. To address this gap, in this work we introduce the
task of sign language retrieval with free-form textual queries: given a written
query (e.g., a sentence) and a large collection of sign language videos, the
objective is to find the signing video in the collection that best matches the
written query. We propose to tackle this task by learning cross-modal
embeddings on the recently introduced large-scale How2Sign dataset of American
Sign Language (ASL). We identify that a key bottleneck in the performance of
the system is the quality of the sign video embedding which suffers from a
scarcity of labeled training data. We, therefore, propose SPOT-ALIGN, a
framework for interleaving iterative rounds of sign spotting and feature
alignment to expand the scope and scale of available training data. We validate
the effectiveness of SPOT-ALIGN for learning a robust sign video embedding
through improvements in both sign recognition and the proposed video retrieval
task.
Related papers
- SLVideo: A Sign Language Video Moment Retrieval Framework [6.782143030167946]
SLVideo is a video moment retrieval system for Sign Language videos.
It extracts embedding representations for the hand and face signs from video frames to capture the signs in their entirety.
A collection of eight hours of annotated Portuguese Sign Language videos is used as the dataset.
arXiv Detail & Related papers (2024-07-22T14:29:36Z) - A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision [74.972172804514]
We introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text.
New dataset annotations provide continuous sign-level annotations for six hours of test videos, and will be made publicly available.
Our model significantly outperforms the previous state of the art on both tasks.
arXiv Detail & Related papers (2024-05-16T17:19:06Z) - DiffSLVA: Harnessing Diffusion Models for Sign Language Video
Anonymization [33.18321022815901]
We introduce DiffSLVA, a novel methodology for text-guided sign language video anonymization.
We develop a specialized module dedicated to capturing facial expressions, which are critical for conveying linguistic information in signed languages.
This innovative methodology makes possible, for the first time, sign language video anonymization that could be used for real-world applications.
arXiv Detail & Related papers (2023-11-27T18:26:19Z) - CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive
Learning [38.83062453145388]
Sign language retrieval consists of two sub-tasks: text-to-sign-video (T2V) retrieval and sign-video-to-text (V2T) retrieval.
We take into account the linguistic properties of both sign languages and natural languages, and simultaneously identify the fine-grained cross-lingual mappings.
Our framework outperforms the pioneering method by large margins on various datasets.
arXiv Detail & Related papers (2023-03-22T17:59:59Z) - Automatic dense annotation of large-vocabulary sign language videos [85.61513254261523]
We propose a simple, scalable framework to vastly increase the density of automatic annotations.
We make these annotations publicly available to support the sign language research community.
arXiv Detail & Related papers (2022-08-04T17:55:09Z) - Scaling up sign spotting through sign language dictionaries [99.50956498009094]
The focus of this work is $textitsign spotting$ - given a video of an isolated sign, our task is to identify $textitwhether$ and $textitwhere$ it has been signed in a continuous, co-articulated sign language video.
We train a model using multiple types of available supervision by: (1) $textitwatching$ existing footage which is sparsely labelled using mouthing cues; (2) $textitreading$ associated subtitles which provide additional translations of the signed content.
We validate the effectiveness of our approach on low
arXiv Detail & Related papers (2022-05-09T10:00:03Z) - Read and Attend: Temporal Localisation in Sign Language Videos [84.30262812057994]
We train a Transformer model to ingest a continuous signing stream and output a sequence of written tokens.
We show that it acquires the ability to attend to a large vocabulary of sign instances in the input sequence, enabling their localisation.
arXiv Detail & Related papers (2021-03-30T16:39:53Z) - Watch, read and lookup: learning to spot signs from multiple supervisors [99.50956498009094]
Given a video of an isolated sign, our task is to identify whether and where it has been signed in a continuous, co-articulated sign language video.
We train a model using multiple types of available supervision by: (1) watching existing sparsely labelled footage; (2) reading associated subtitles which provide additional weak-supervision; and (3) looking up words in visual sign language dictionaries.
These three tasks are integrated into a unified learning framework using the principles of Noise Contrastive Estimation and Multiple Instance Learning.
arXiv Detail & Related papers (2020-10-08T14:12:56Z) - BSL-1K: Scaling up co-articulated sign language recognition using
mouthing cues [106.21067543021887]
We show how to use mouthing cues from signers to obtain high-quality annotations from video data.
The BSL-1K dataset is a collection of British Sign Language (BSL) signs of unprecedented scale.
arXiv Detail & Related papers (2020-07-23T16:59:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.