Looking for the Signs: Identifying Isolated Sign Instances in Continuous
Video Footage
- URL: http://arxiv.org/abs/2108.04229v1
- Date: Wed, 21 Jul 2021 12:49:44 GMT
- Title: Looking for the Signs: Identifying Isolated Sign Instances in Continuous
Video Footage
- Authors: Tao Jiang, Necati Cihan Camgoz, Richard Bowden
- Abstract summary: We propose a transformer-based network, called SignLookup, to extract-temporal representations from video clips.
Our model achieves state-of-the-art performance on the sign spotting task with accuracy as high as 96% on challenging benchmark datasets.
- Score: 45.29710323525548
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we focus on the task of one-shot sign spotting, i.e. given an
example of an isolated sign (query), we want to identify whether/where this
sign appears in a continuous, co-articulated sign language video (target). To
achieve this goal, we propose a transformer-based network, called SignLookup.
We employ 3D Convolutional Neural Networks (CNNs) to extract spatio-temporal
representations from video clips. To solve the temporal scale discrepancies
between the query and the target videos, we construct multiple queries from a
single video clip using different frame-level strides. Self-attention is
applied across these query clips to simulate a continuous scale space. We also
utilize another self-attention module on the target video to learn the
contextual within the sequence. Finally a mutual-attention is used to match the
temporal scales to localize the query within the target sequence. Extensive
experiments demonstrate that the proposed approach can not only reliably
identify isolated signs in continuous videos, regardless of the signers'
appearance, but can also generalize to different sign languages. By taking
advantage of the attention mechanism and the adaptive features, our model
achieves state-of-the-art performance on the sign spotting task with accuracy
as high as 96% on challenging benchmark datasets and significantly
outperforming other approaches.
Related papers
- Continuous Sign Language Recognition Using Intra-inter Gloss Attention [0.0]
In this study, we introduce a novel module in sign language recognition studies, called intra-inter gloss attention module.
In the intra-gloss attention module, the video is divided into equally sized chunks and a self-attention mechanism is applied within each chunk.
Experimental results on the PHOENIX-2014 benchmark dataset demonstrate that our method can effectively extract sign language features in an end-to-end manner.
arXiv Detail & Related papers (2024-06-26T13:21:08Z) - Feature Disentanglement Learning with Switching and Aggregation for
Video-based Person Re-Identification [9.068045610800667]
In video person re-identification (Re-ID), the network must consistently extract features of the target person from successive frames.
Existing methods tend to focus only on how to use temporal information, which often leads to networks being fooled by similar appearances and same backgrounds.
We propose a Disentanglement and Switching and Aggregation Network (DSANet), which segregates the features representing identity and features based on camera characteristics, and pays more attention to ID information.
arXiv Detail & Related papers (2022-12-16T04:27:56Z) - Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z) - Less than Few: Self-Shot Video Instance Segmentation [50.637278655763616]
We propose to automatically learn to find appropriate support videos given a query.
We tackle, for the first time, video instance segmentation in a self-shot (and few-shot) setting.
We provide strong baseline performances that utilize a novel transformer-based model.
arXiv Detail & Related papers (2022-04-19T13:14:43Z) - TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for
Sign Language Translation [101.6042317204022]
Sign language translation (SLT) aims to interpret sign video sequences into text-based natural language sentences.
Existing SLT models usually represent sign visual features in a frame-wise manner.
We develop a novel hierarchical sign video feature learning method via a temporal semantic pyramid network, called TSPNet.
arXiv Detail & Related papers (2020-10-12T05:58:09Z) - Watch, read and lookup: learning to spot signs from multiple supervisors [99.50956498009094]
Given a video of an isolated sign, our task is to identify whether and where it has been signed in a continuous, co-articulated sign language video.
We train a model using multiple types of available supervision by: (1) watching existing sparsely labelled footage; (2) reading associated subtitles which provide additional weak-supervision; and (3) looking up words in visual sign language dictionaries.
These three tasks are integrated into a unified learning framework using the principles of Noise Contrastive Estimation and Multiple Instance Learning.
arXiv Detail & Related papers (2020-10-08T14:12:56Z) - Unsupervised Learning of Video Representations via Dense Trajectory
Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation.
We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.