Related papers: Denoising-Diffusion Alignment for Continuous Sign Language Recognition

Denoising-Diffusion Alignment for Continuous Sign Language Recognition

URL: http://arxiv.org/abs/2305.03614v4
Date: Fri, 3 May 2024 04:11:55 GMT
Title: Denoising-Diffusion Alignment for Continuous Sign Language Recognition
Authors: Leming Guo, Wanli Xue, Yuxi Zhou, Ze Kang, Tiantian Yuan, Zan Gao, Shengyong Chen,
Abstract summary: Key challenge of Continuous sign language recognition is how to achieve cross-modality alignment between videos and gloss sequences. We propose a novel Denoising-Diffusion global alignment (DDA) DDA uses diffusion-based global alignment techniques to align video with gloss sequence, facilitating global temporal context alignment.
Score: 24.376213903941746
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Continuous sign language recognition (CSLR) aims to promote active and accessible communication for the hearing impaired, by recognizing signs in untrimmed sign language videos to textual glosses sequentially. The key challenge of CSLR is how to achieve the cross-modality alignment between videos and gloss sequences. However, the current cross-modality paradigms of CSLR overlook using the glosses context to guide the video clips for global temporal context alignment, which further affects the visual to gloss mapping and is detrimental to recognition performance. To tackle this problem, we propose a novel Denoising-Diffusion global Alignment (DDA), which consists of a denoising-diffusion autoencoder and DDA loss function. DDA leverages diffusion-based global alignment techniques to align video with gloss sequence, facilitating global temporal context alignment. Specifically, DDA first proposes the auxiliary condition diffusion to conduct the gloss-part noised bimodal representations for video and gloss sequence. To address the problem of the recognition-oriented alignment knowledge represented in the diffusion denoising process cannot be feedback. The DDA further proposes the Denoising-Diffusion Autoencoder, which adds a decoder in the auxiliary condition diffusion to denoise the partial noisy bimodal representations via the designed DDA loss in self-supervised. In the denoising process, each video clip representation of video can be reliably guided to re-establish the global temporal context between them via denoising the gloss sequence representation. Experiments on three public benchmarks demonstrate that our DDA achieves state-of-the-art performances and confirm the feasibility of DDA for video representation enhancement.

Related papers

AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection. Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
Temporal As a Plugin: Unsupervised Video Denoising with Pre-Trained Image Denoisers [30.965705043127144]
In this paper, we propose a novel unsupervised video denoising framework, named Temporal As aTAP' (TAP) By incorporating temporal modules, our method can harness temporal information across noisy frames, complementing its power of spatial denoising. Compared to other unsupervised video denoising methods, our framework demonstrates superior performance on both sRGB and raw video denoising datasets.
arXiv Detail & Related papers (2024-09-17T15:05:33Z)
IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation [136.5813547244979]
We present IDOL (unIfied Dual-mOdal Latent diffusion) for high-quality human-centric joint video-depth generation. Our IDOL consists of two novel designs. First, to enable dual-modal generation and maximize the information exchange between video and depth generation. Second, to ensure a precise video-depth spatial alignment, we propose a motion consistency loss that enforces consistency between the video and depth feature motion fields.
arXiv Detail & Related papers (2024-07-15T17:36:54Z)
Label-anticipated Event Disentanglement for Audio-Visual Video Parsing [61.08434062821899]
We introduce a new decoding paradigm, underlinelabel sunderlineemunderlineantic-based underlineprojection (LEAP) LEAP works by iteratively projecting encoded latent features of audio/visual segments onto semantically independent label embeddings. To facilitate the LEAP paradigm, we propose a semantic-aware optimization strategy, which includes a novel audio-visual semantic similarity loss function.
arXiv Detail & Related papers (2024-07-11T01:57:08Z)
Unsupervised Modality-Transferable Video Highlight Detection with Representation Activation Sequence Learning [7.908887001497406]
We propose a novel model with cross-modal perception for unsupervised highlight detection. The proposed model learns representations with visual-audio level semantics from image-audio pair data via a self-reconstruction task. The experimental results show that the proposed framework achieves superior performance compared to other state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-14T13:52:03Z)
SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by Visual-Textual Contrastive Learning [51.800031281177105]
SignVTCL is a continuous sign language recognition framework enhanced by visual-textual contrastive learning. It integrates multi-modal data (video, keypoints, and optical flow) simultaneously to train a unified visual backbone. It achieves state-of-the-art results compared with previous methods.
arXiv Detail & Related papers (2024-01-22T11:04:55Z)
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes. Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z)
Denoising Diffusion Autoencoders are Unified Self-supervised Learners [58.194184241363175]
This paper shows that the networks in diffusion models, namely denoising diffusion autoencoders (DDAE), are unified self-supervised learners. DDAE has already learned strongly linear-separable representations within its intermediate layers without auxiliary encoders. Our diffusion-based approach achieves 95.9% and 50.0% linear evaluation accuracies on CIFAR-10 and Tiny-ImageNet.
arXiv Detail & Related papers (2023-03-17T04:20:47Z)
VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation [88.49030739715701]
This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis. Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation.
arXiv Detail & Related papers (2023-03-15T02:16:39Z)
CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment [42.10603331311837]
Sign language recognition ( SLR) is a weakly supervised task that annotates sign videos as textual glosses. Recent studies show that insufficient training caused by the lack of large-scale available sign datasets becomes the main bottleneck for SLR. We propose a novel contrastive visual transformation for SLR,- SLR, to fully explore the pretrained knowledge of both the visual and language modalities.
arXiv Detail & Related papers (2023-03-10T06:12:36Z)
Refined Semantic Enhancement towards Frequency Diffusion for Video Captioning [29.617527535279574]
Video captioning aims to generate natural language sentences that describe the given video accurately. Existing methods obtain favorable generation by exploring richer visual representations in encode phase or improving the decoding ability. We introduce a novel Refined Semantic enhancement method towards Frequency Diffusion (RSFD), a captioning model that constantly perceives the linguistic representation of the infrequent tokens.
arXiv Detail & Related papers (2022-11-28T05:45:17Z)
Learning Task-Oriented Flows to Mutually Guide Feature Alignment in Synthesized and Real Video Denoising [137.5080784570804]
Video denoising aims at removing noise from videos to recover clean ones. Some existing works show that optical flow can help the denoising by exploiting the additional spatial-temporal clues from nearby frames. We propose a new multi-scale refined optical flow-guided video denoising method, which is more robust to different noise levels.
arXiv Detail & Related papers (2022-08-25T00:09:18Z)
CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS) CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
Visual-aware Attention Dual-stream Decoder for Video Captioning [12.139806877591212]
The attention mechanism in the current video captioning method learns to assign weight to each frame, promoting the decoder dynamically. This may not explicitly model the correlation and the temporal coherence of the visual features extracted in the sequence frames. We propose a new Visual-aware Attention (VA) model, which unifies changes of temporal sequence frames with the words at the previous moment. The effectiveness of the proposed Visual-aware Attention Dual-stream Decoder (VADD) is demonstrated.
arXiv Detail & Related papers (2021-10-16T14:08:20Z)
Diffusion-Based Representation Learning [65.55681678004038]
We augment the denoising score matching framework to enable representation learning without any supervised signal. In contrast, the introduced diffusion-based representation learning relies on a new formulation of the denoising score matching objective. Using the same approach, we propose to learn an infinite-dimensional latent code that achieves improvements of state-of-the-art models on semi-supervised image classification.
arXiv Detail & Related papers (2021-05-29T09:26:02Z)
Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild. Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation. Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z)
Fully Unsupervised Diversity Denoising with Convolutional Variational Autoencoders [81.30960319178725]
We propose DivNoising, a denoising approach based on fully convolutional variational autoencoders (VAEs) First we introduce a principled way of formulating the unsupervised denoising problem within the VAE framework by explicitly incorporating imaging noise models into the decoder. We show that such a noise model can either be measured, bootstrapped from noisy data, or co-learned during training.
arXiv Detail & Related papers (2020-06-10T21:28:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.