Denoising-Contrastive Alignment for Continuous Sign Language Recognition
- URL: http://arxiv.org/abs/2305.03614v5
- Date: Sun, 01 Dec 2024 12:06:46 GMT
- Title: Denoising-Contrastive Alignment for Continuous Sign Language Recognition
- Authors: Leming Guo, Wanli Xue, Shengyong Chen,
- Abstract summary: Continuous sign language recognition aims to recognize signs in untrimmed sign language videos to textual glosses.
Current cross-modality alignment paradigms often neglect the role of textual grammar to guide the video representation.
We propose a Denoising-Contrastive Alignment paradigm to enhance video representations.
- Score: 22.800767994061175
- License:
- Abstract: Continuous sign language recognition (CSLR) aims to recognize signs in untrimmed sign language videos to textual glosses. A key challenge of CSLR is achieving effective cross-modality alignment between video and gloss sequences to enhance video representation. However, current cross-modality alignment paradigms often neglect the role of textual grammar to guide the video representation in learning global temporal context, which adversely affects recognition performance. To tackle this limitation, we propose a Denoising-Contrastive Alignment (DCA) paradigm. DCA creatively leverages textual grammar to enhance video representations through two complementary approaches: modeling the instance correspondence between signs and glosses from a discrimination perspective and aligning their global context from a generative perspective. Specifically, DCA accomplishes flexible instance-level correspondence between signs and glosses using a contrastive loss. Building on this, DCA models global context alignment between the video and gloss sequences by denoising the gloss representation from noise, guided by video representation. Additionally, DCA introduces gradient modulation to optimize the alignment and recognition gradients, ensuring a more effective learning process. By integrating gloss-wise and global context knowledge, DCA significantly enhances video representations for CSLR tasks. Experimental results across public benchmarks validate the effectiveness of DCA and confirm its video representation enhancement feasibility.
Related papers
- Label-anticipated Event Disentanglement for Audio-Visual Video Parsing [61.08434062821899]
We introduce a new decoding paradigm, underlinelabel sunderlineemunderlineantic-based underlineprojection (LEAP)
LEAP works by iteratively projecting encoded latent features of audio/visual segments onto semantically independent label embeddings.
To facilitate the LEAP paradigm, we propose a semantic-aware optimization strategy, which includes a novel audio-visual semantic similarity loss function.
arXiv Detail & Related papers (2024-07-11T01:57:08Z) - Unsupervised Modality-Transferable Video Highlight Detection with Representation Activation Sequence Learning [7.908887001497406]
We propose a novel model with cross-modal perception for unsupervised highlight detection.
The proposed model learns representations with visual-audio level semantics from image-audio pair data via a self-reconstruction task.
The experimental results show that the proposed framework achieves superior performance compared to other state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-14T13:52:03Z) - SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by
Visual-Textual Contrastive Learning [51.800031281177105]
SignVTCL is a continuous sign language recognition framework enhanced by visual-textual contrastive learning.
It integrates multi-modal data (video, keypoints, and optical flow) simultaneously to train a unified visual backbone.
It achieves state-of-the-art results compared with previous methods.
arXiv Detail & Related papers (2024-01-22T11:04:55Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language
Recognition with Variational Alignment [42.10603331311837]
Sign language recognition ( SLR) is a weakly supervised task that annotates sign videos as textual glosses.
Recent studies show that insufficient training caused by the lack of large-scale available sign datasets becomes the main bottleneck for SLR.
We propose a novel contrastive visual transformation for SLR,- SLR, to fully explore the pretrained knowledge of both the visual and language modalities.
arXiv Detail & Related papers (2023-03-10T06:12:36Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - DVCFlow: Modeling Information Flow Towards Human-like Video Captioning [163.71539565491113]
Existing methods mainly generate captions from individual video segments, lacking adaptation to the global visual context.
We introduce the concept of information flow to model the progressive information changing across video sequence and captions.
Our method significantly outperforms competitive baselines, and generates more human-like text according to subject and objective tests.
arXiv Detail & Related papers (2021-11-19T10:46:45Z) - Visual-aware Attention Dual-stream Decoder for Video Captioning [12.139806877591212]
The attention mechanism in the current video captioning method learns to assign weight to each frame, promoting the decoder dynamically.
This may not explicitly model the correlation and the temporal coherence of the visual features extracted in the sequence frames.
We propose a new Visual-aware Attention (VA) model, which unifies changes of temporal sequence frames with the words at the previous moment.
The effectiveness of the proposed Visual-aware Attention Dual-stream Decoder (VADD) is demonstrated.
arXiv Detail & Related papers (2021-10-16T14:08:20Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.