Cross-Modal Consistency Learning for Sign Language Recognition
- URL: http://arxiv.org/abs/2503.12485v2
- Date: Fri, 21 Mar 2025 09:36:55 GMT
- Title: Cross-Modal Consistency Learning for Sign Language Recognition
- Authors: Kepeng Wu, Zecheng Li, Hezhen Hu, Wengang Zhou, Houqiang Li,
- Abstract summary: Existing pre-training methods solely focus on the compact pose data.<n>We propose a Cross-modal Consistency Learning framework (CCL- SLR)<n>CCL- SLR learns from both RGB and pose modalities based on self-supervised pre-training.
- Score: 92.44927164283641
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-training has been proven to be effective in boosting the performance of Isolated Sign Language Recognition (ISLR). Existing pre-training methods solely focus on the compact pose data, which eliminates background perturbation but inevitably suffers from insufficient semantic cues compared to raw RGB videos. Nevertheless, learning representation directly from RGB videos remains challenging due to the presence of sign-independent visual features. To address this dilemma, we propose a Cross-modal Consistency Learning framework (CCL-SLR), which leverages the cross-modal consistency from both RGB and pose modalities based on self-supervised pre-training. First, CCL-SLR employs contrastive learning for instance discrimination within and across modalities. Through the single-modal and cross-modal contrastive learning, CCL-SLR gradually aligns the feature spaces of RGB and pose modalities, thereby extracting consistent sign representations. Second, we further introduce Motion-Preserving Masking (MPM) and Semantic Positive Mining (SPM) techniques to improve cross-modal consistency from the perspective of data augmentation and sample similarity, respectively. Extensive experiments on four ISLR benchmarks show that CCL-SLR achieves impressive performance, demonstrating its effectiveness. The code will be released to the public.
Related papers
- Semantic-Aligned Learning with Collaborative Refinement for Unsupervised VI-ReID [82.12123628480371]
Unsupervised person re-identification (USL-VI-ReID) seeks to match pedestrian images of the same individual across different modalities without human annotations for model learning.
Previous methods unify pseudo-labels of cross-modality images through label association algorithms and then design contrastive learning framework for global feature learning.
We propose a Semantic-Aligned Learning with Collaborative Refinement (SALCR) framework, which builds up objective for specific fine-grained patterns emphasized by each modality.
arXiv Detail & Related papers (2025-04-27T13:58:12Z) - Extended Cross-Modality United Learning for Unsupervised Visible-Infrared Person Re-identification [34.93081601924748]
Unsupervised learning aims to learn modality-invariant features from unlabeled cross-modality datasets.
Existing methods lack cross-modality clustering or excessively pursue cluster-level association.
We propose Extended Cross-Modality United Learning (ECUL) framework, incorporating Extended Modality-Camera Clustering (EMCC) and Two-Step Memory Updating Strategy (TSMem) modules.
arXiv Detail & Related papers (2024-12-26T09:30:26Z) - AmCLR: Unified Augmented Learning for Cross-Modal Representations [0.0]
We introduce AmCLR and xAmCLR objective functions tailored for bimodal vision-language models.
These advancements yield a more resilient and generalizable contrastive learning process.
arXiv Detail & Related papers (2024-12-10T23:32:36Z) - Contrastive Learning with Synthetic Positives [11.932323457691945]
Contrastive learning with the nearest neighbor has proved to be one of the most efficient self-supervised learning (SSL) techniques.
In this paper, we introduce a novel approach called Contrastive Learning with Synthetic Positives (NCLP)
NCLP utilizes synthetic images, generated by an unconditional diffusion model, as the additional positives to help the model learn from diverse positives.
arXiv Detail & Related papers (2024-08-30T01:47:43Z) - Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model
via Cross-modal Alignment [2.389598109913754]
We focus on Contrastive Language-Image Pre-training (CLIP), an open-vocabulary foundation model, which achieves high accuracy across many image classification tasks.
There are still domains where zero-shot CLIP performance is far from optimal, such as Remote Sensing (RS) and medical imagery.
We propose a methodology for the purpose of aligning distinct RS imagery modalities with the visual and textual modalities of CLIP.
arXiv Detail & Related papers (2024-02-15T09:31:07Z) - SSLCL: An Efficient Model-Agnostic Supervised Contrastive Learning
Framework for Emotion Recognition in Conversations [20.856739541819056]
Emotion recognition in conversations (ERC) is a rapidly evolving task within the natural language processing community.
We propose an efficient and model-agnostic SCL framework named Supervised Sample-Label Contrastive Learning with Soft-HGR Maximal Correlation (SSLCL)
We introduce a novel perspective on utilizing label representations by projecting discrete labels into dense embeddings through a shallow multilayer perceptron.
arXiv Detail & Related papers (2023-10-25T14:41:14Z) - ProCC: Progressive Cross-primitive Compatibility for Open-World
Compositional Zero-Shot Learning [29.591615811894265]
Open-World Compositional Zero-shot Learning (OW-CZSL) aims to recognize novel compositions of state and object primitives in images with no priors on the compositional space.
We propose a novel method, termed Progressive Cross-primitive Compatibility (ProCC), to mimic the human learning process for OW-CZSL tasks.
arXiv Detail & Related papers (2022-11-19T10:09:46Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - 3D Human Action Representation Learning via Cross-View Consistency
Pursuit [52.19199260960558]
We propose a Cross-view Contrastive Learning framework for unsupervised 3D skeleton-based action Representation (CrosSCLR)
CrosSCLR consists of both single-view contrastive learning (SkeletonCLR) and cross-view consistent knowledge mining (CVC-KM) modules, integrated in a collaborative learning manner.
arXiv Detail & Related papers (2021-04-29T16:29:41Z) - Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person
Re-Identification [208.1227090864602]
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem.
Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images.
We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
arXiv Detail & Related papers (2020-07-18T03:08:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.