Boosting Continuous Sign Language Recognition via Cross Modality
Augmentation
- URL: http://arxiv.org/abs/2010.05264v1
- Date: Sun, 11 Oct 2020 15:07:50 GMT
- Title: Boosting Continuous Sign Language Recognition via Cross Modality
Augmentation
- Authors: Junfu Pu, Wengang Zhou, Hezhen Hu, Houqiang Li
- Abstract summary: Continuous sign language recognition deals with unaligned video-text pair.
We propose a novel architecture with cross modality augmentation.
The proposed framework can be easily extended to other existing CTC based continuous SLR architectures.
- Score: 135.30357113518127
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Continuous sign language recognition (SLR) deals with unaligned video-text
pair and uses the word error rate (WER), i.e., edit distance, as the main
evaluation metric. Since it is not differentiable, we usually instead optimize
the learning model with the connectionist temporal classification (CTC)
objective loss, which maximizes the posterior probability over the sequential
alignment. Due to the optimization gap, the predicted sentence with the highest
decoding probability may not be the best choice under the WER metric. To tackle
this issue, we propose a novel architecture with cross modality augmentation.
Specifically, we first augment cross-modal data by simulating the calculation
procedure of WER, i.e., substitution, deletion and insertion on both text label
and its corresponding video. With these real and generated pseudo video-text
pairs, we propose multiple loss terms to minimize the cross modality distance
between the video and ground truth label, and make the network distinguish the
difference between real and pseudo modalities. The proposed framework can be
easily extended to other existing CTC based continuous SLR architectures.
Extensive experiments on two continuous SLR benchmarks, i.e.,
RWTH-PHOENIX-Weather and CSL, validate the effectiveness of our proposed
method.
Related papers
- Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter [57.64003871384959]
This work presents a new approach to fast context-biasing with CTC-based Word Spotter.
The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates.
The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER.
arXiv Detail & Related papers (2024-06-11T09:37:52Z) - A Flexible Recursive Network for Video Stereo Matching Based on Residual Estimation [0.9362376508480733]
RecSM is a network based on residual estimation for video stereo matching.
With a stack count of 3, RecSM achieves a 4x speed improvement compared to ACVNet, running at 0.054 seconds based on one NVIDIA 2080TI GPU.
arXiv Detail & Related papers (2024-06-05T14:49:14Z) - AdaBrowse: Adaptive Video Browser for Efficient Continuous Sign Language
Recognition [39.778958624066185]
We propose a novel model (AdaBrowse) to dynamically select a most informative subsequence from input video sequences.
AdaBrowse achieves comparable accuracy with state-of-the-art methods with 1.44$times$ throughput and 2.12$times$ fewer FLOPs.
arXiv Detail & Related papers (2023-08-16T12:40:47Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Correspondence Matters for Video Referring Expression Comprehension [64.60046797561455]
Video Referring Expression (REC) aims to localize the referent objects described in the sentence to visual regions in the video frames.
Existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects.
We propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners.
arXiv Detail & Related papers (2022-07-21T10:31:39Z) - Disentangled Representation Learning for Text-Video Retrieval [51.861423831566626]
Cross-modality interaction is a critical component in Text-Video Retrieval (TVR)
We study the interaction paradigm in depth, where we find that its computation can be split into two terms.
We propose a disentangled framework to capture a sequential and hierarchical representation.
arXiv Detail & Related papers (2022-03-14T13:55:33Z) - Rescoring Sequence-to-Sequence Models for Text Line Recognition with
CTC-Prefixes [0.0]
We propose to use the CTC-Prefix-Score during S2S decoding.
During beam search, paths that are invalid according to the CTC confidence matrix are penalised.
We evaluate this setup on three HTR data sets: IAM, Rimes, and StAZH.
arXiv Detail & Related papers (2021-10-12T11:40:05Z) - Few-Shot Action Recognition with Compromised Metric via Optimal
Transport [31.834843714684343]
Few-shot action recognition is still not mature despite the wide research of few-shot image classification.
One main obstacle to applying these algorithms in action recognition is the complex structure of videos.
We propose Compromised Metric via Optimal Transport (CMOT) to combine the advantages of these two solutions.
arXiv Detail & Related papers (2021-04-08T12:42:05Z) - Inter-class Discrepancy Alignment for Face Recognition [55.578063356210144]
We propose a unified framework calledInter-class DiscrepancyAlignment(IDA)
IDA-DAO is used to align the similarity scores considering the discrepancy between the images and its neighbors.
IDA-SSE can provide convincing inter-class neighbors by introducing virtual candidate images generated with GAN.
arXiv Detail & Related papers (2021-03-02T08:20:08Z) - Neural Non-Rigid Tracking [26.41847163649205]
We introduce a novel, end-to-end learnable, differentiable non-rigid tracker.
We employ a convolutional neural network to predict dense correspondences and their confidences.
Compared to state-of-the-art approaches, our algorithm shows improved reconstruction performance.
arXiv Detail & Related papers (2020-06-23T18:00:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.