Related papers: DESign: Dynamic Context-Aware Convolution and Efficient Subnet Regularization for Continuous Sign Language Recognition

DESign: Dynamic Context-Aware Convolution and Efficient Subnet Regularization for Continuous Sign Language Recognition

URL: http://arxiv.org/abs/2507.03339v1
Date: Fri, 04 Jul 2025 06:56:28 GMT
Title: DESign: Dynamic Context-Aware Convolution and Efficient Subnet Regularization for Continuous Sign Language Recognition
Authors: Sheng Liu, Yiheng Yu, Yuan Feng, Min Xu, Zhelun Jin, Yining Jiang, Tiantian Yuan,
Abstract summary: We propose DESign, a novel framework that incorporates Dynamic Context-Aware Convolution (DCAC) and Subnet Regularization Connectionist Temporal Classification (SR-CTC)<n>DCAC dynamically captures the inter-frame motion cues that constitute signs and uniquely adapts convolutional weights based on contextual information.<n>SR-CTC regularizes training by applying supervision tovolutionworks, encouraging the model to explore diverse CTC alignment paths and effectively preventing overfitting.
Score: 11.879737436401175
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current continuous sign language recognition (CSLR) methods struggle with handling diverse samples. Although dynamic convolutions are ideal for this task, they mainly focus on spatial modeling and fail to capture the temporal dynamics and contextual dependencies. To address this, we propose DESign, a novel framework that incorporates Dynamic Context-Aware Convolution (DCAC) and Subnet Regularization Connectionist Temporal Classification (SR-CTC). DCAC dynamically captures the inter-frame motion cues that constitute signs and uniquely adapts convolutional weights in a fine-grained manner based on contextual information, enabling the model to better generalize across diverse signing behaviors and boost recognition accuracy. Furthermore, we observe that existing methods still rely on only a limited number of frames for parameter updates during training, indicating that CTC learning overfits to a dominant path. To address this, SR-CTC regularizes training by applying supervision to subnetworks, encouraging the model to explore diverse CTC alignment paths and effectively preventing overfitting. A classifier-sharing strategy in SR-CTC further strengthens multi-scale consistency. Notably, SR-CTC introduces no inference overhead and can be seamlessly integrated into existing CSLR models to boost performance. Extensive ablations and visualizations further validate the effectiveness of the proposed methods. Results on mainstream CSLR datasets (i.e., PHOENIX14, PHOENIX14-T, CSL-Daily) demonstrate that DESign achieves state-of-the-art performance.

Related papers

Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning [81.02648336552421]
We propose a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder.<n>Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder.<n> Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance.
arXiv Detail & Related papers (2025-03-23T03:21:33Z)
Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter [57.64003871384959]
This work presents a new approach to fast context-biasing with CTC-based Word Spotter. The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates. The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER.
arXiv Detail & Related papers (2024-06-11T09:37:52Z)
Continuous Sign Language Recognition with Adapted Conformer via Unsupervised Pretraining [0.6144680854063939]
State-of-the-art Conformer model for Speech Recognition is adapted for continuous sign language recognition. This marks the first instance of employing Conformer for a vision-based task. Unsupervised pretraining is conducted on a curated sign language dataset.
arXiv Detail & Related papers (2024-05-20T13:40:52Z)
Unleashing Network Potentials for Semantic Scene Completion [50.95486458217653]
This paper proposes a novel SSC framework - Adrial Modality Modulation Network (AMMNet) AMMNet introduces two core modules: a cross-modal modulation enabling the interdependence of gradient flows between modalities, and a customized adversarial training scheme leveraging dynamic gradient competition. Extensive experimental results demonstrate that AMMNet outperforms state-of-the-art SSC methods by a large margin.
arXiv Detail & Related papers (2024-03-12T11:48:49Z)
SSLCL: An Efficient Model-Agnostic Supervised Contrastive Learning Framework for Emotion Recognition in Conversations [20.856739541819056]
Emotion recognition in conversations (ERC) is a rapidly evolving task within the natural language processing community. We propose an efficient and model-agnostic SCL framework named Supervised Sample-Label Contrastive Learning with Soft-HGR Maximal Correlation (SSLCL) We introduce a novel perspective on utilizing label representations by projecting discrete labels into dense embeddings through a shallow multilayer perceptron.
arXiv Detail & Related papers (2023-10-25T14:41:14Z)
SCD-Net: Spatiotemporal Clues Disentanglement Network for Self-supervised Skeleton-based Action Recognition [39.99711066167837]
This paper introduces a contrastive learning framework, namely Stemporal Clues Disentanglement Network (SCD-Net) Specifically, we integrate the sequences with a feature extractor to derive explicit clues from spatial and temporal domains respectively. We conduct evaluations on the NTU-+D (60&120) PKU-MMDI (&I) datasets, covering various downstream tasks such as action recognition, action retrieval, transfer learning.
arXiv Detail & Related papers (2023-09-11T21:32:13Z)
Continual Vision-Language Representation Learning with Off-Diagonal Information [112.39419069447902]
Multi-modal contrastive learning frameworks like CLIP typically require a large amount of image-text samples for training. This paper discusses the feasibility of continual CLIP training using streaming data.
arXiv Detail & Related papers (2023-05-11T08:04:46Z)
DELTA: Dynamic Embedding Learning with Truncated Conscious Attention for CTR Prediction [61.68415731896613]
Click-Through Rate (CTR) prediction is a pivotal task in product and content recommendation. We propose a model that enables Dynamic Embedding Learning with Truncated Conscious Attention for CTR prediction.
arXiv Detail & Related papers (2023-05-03T12:34:45Z)
Adaptive Discrete Communication Bottlenecks with Dynamic Vector Quantization [76.68866368409216]
We propose learning to dynamically select discretization tightness conditioned on inputs. We show that dynamically varying tightness in communication bottlenecks can improve model performance on visual reasoning and reinforcement learning tasks.
arXiv Detail & Related papers (2022-02-02T23:54:26Z)
Alignment Knowledge Distillation for Online Streaming Attention-based Speech Recognition [46.69852287267763]
This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems. The proposed method significantly reduces recognition errors and emission latency simultaneously. The best MoChA system shows performance comparable to that of RNN-transducer (RNN-T)
arXiv Detail & Related papers (2021-02-28T08:17:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.