Related papers: Vision-Language Meets the Skeleton: Progressively Distillation with Cross-Modal Knowledge for 3D Action Representation Learning

Vision-Language Meets the Skeleton: Progressively Distillation with Cross-Modal Knowledge for 3D Action Representation Learning

URL: http://arxiv.org/abs/2405.20606v1
Date: Fri, 31 May 2024 03:40:15 GMT
Title: Vision-Language Meets the Skeleton: Progressively Distillation with Cross-Modal Knowledge for 3D Action Representation Learning
Authors: Yang Chen, Tian He, Junfeng Fu, Ling Wang, Jingcai Guo, Hong Cheng,
Abstract summary: We introduce a novel skeleton-based training framework based on Cross-modal Contrastive learning. We establish the vision-language action concept space through vision-language knowledge prompts. We propose the intra-modal self-similarity and inter-modal cross-consistency softened targets in the cross-modal contrastive process.
Score: 20.325924702966084
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Supervised and self-supervised learning are two main training paradigms for skeleton-based human action recognition. However, the former one-hot classification requires labor-intensive predefined action categories annotations, while the latter involves skeleton transformations (e.g., cropping) in the pretext tasks that may impair the skeleton structure. To address these challenges, we introduce a novel skeleton-based training framework (C$^2$VL) based on Cross-modal Contrastive learning that uses the progressive distillation to learn task-agnostic human skeleton action representation from the Vision-Language knowledge prompts. Specifically, we establish the vision-language action concept space through vision-language knowledge prompts generated by pre-trained large multimodal models (LMMs), which enrich the fine-grained details that the skeleton action space lacks. Moreover, we propose the intra-modal self-similarity and inter-modal cross-consistency softened targets in the cross-modal contrastive process to progressively control and guide the degree of pulling vision-language knowledge prompts and corresponding skeletons closer. These soft instance discrimination and self-knowledge distillation strategies contribute to the learning of better skeleton-based action representations from the noisy skeleton-vision-language pairs. During the inference phase, our method requires only the skeleton data as the input for action recognition and no longer for vision-language prompts. Extensive experiments show that our method achieves state-of-the-art results on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets. The code will be available in the future.

Related papers

Language Guided Concept Bottleneck Models for Interpretable Continual Learning [62.09201360376577]
Continual learning aims to enable learning systems to acquire new knowledge constantly without forgetting previously learned information. Most existing CL methods focus primarily on preserving learned knowledge to improve model performance. We introduce a novel framework that integrates language-guided Concept Bottleneck Models to address both challenges.
arXiv Detail & Related papers (2025-03-30T02:41:55Z)
Self-Supervised Skeleton-Based Action Representation Learning: A Benchmark and Beyond [19.074841631219233]
Self-supervised learning (SSL) has been proven effective for skeleton-based action understanding. In this paper, we conduct a comprehensive survey on self-supervised skeleton-based action representation learning.
arXiv Detail & Related papers (2024-06-05T06:21:54Z)
An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition [49.45660055499103]
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training. Previous research has focused on aligning sequences' visual and semantic spatial distributions. We introduce a new loss function sampling method to obtain a tight and robust representation.
arXiv Detail & Related papers (2024-06-02T06:53:01Z)
Skeleton2vec: A Self-supervised Learning Framework with Contextualized Target Representations for Skeleton Sequence [56.092059713922744]
We show that using high-level contextualized features as prediction targets can achieve superior performance. Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework. Our proposed Skeleton2vec outperforms previous methods and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-01-01T12:08:35Z)
Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment [55.11291053011696]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited. To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy. In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark.
arXiv Detail & Related papers (2023-12-01T15:47:04Z)
SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models. SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation. State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z)
Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage. We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets. By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z)
SimMC: Simple Masked Contrastive Learning of Skeleton Representations for Unsupervised Person Re-Identification [63.903237777588316]
We present a generic Simple Masked Contrastive learning (SimMC) framework to learn effective representations from unlabeled 3D skeletons for person re-ID. Specifically, to fully exploit skeleton features within each skeleton sequence, we first devise a masked prototype contrastive learning (MPC) scheme. Then, we propose the masked intra-sequence contrastive learning (MIC) to capture intra-sequence pattern consistency between subsequences.
arXiv Detail & Related papers (2022-04-21T00:19:38Z)
Joint-bone Fusion Graph Convolutional Network for Semi-supervised Skeleton Action Recognition [65.78703941973183]
We propose a novel correlation-driven joint-bone fusion graph convolutional network (CD-JBF-GCN) as an encoder and use a pose prediction head as a decoder. Specifically, the CD-JBF-GC can explore the motion transmission between the joint stream and the bone stream. The pose prediction based auto-encoder in the self-supervised training stage allows the network to learn motion representation from unlabeled data.
arXiv Detail & Related papers (2022-02-08T16:03:15Z)
Skeleton-Contrastive 3D Action Representation Learning [35.06361753065124]
This paper strives for self-supervised learning of a feature space suitable for skeleton-based action recognition. Our approach achieves state-of-the-art performance for self-supervised learning from skeleton data on the challenging PKU and NTU datasets.
arXiv Detail & Related papers (2021-08-08T14:44:59Z)
MS$^2$L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition [36.74293548921099]
We integrate motion prediction, jigsaw puzzle recognition, and contrastive learning to learn skeleton features from different aspects. Our experiments on the NW-UCLA, NTU RGB+D, and PKUMMD datasets show remarkable performance for action recognition.
arXiv Detail & Related papers (2020-10-12T11:09:44Z)
Augmented Skeleton Based Contrastive Action Learning with Momentum LSTM for Unsupervised Action Recognition [16.22360992454675]
Action recognition via 3D skeleton data is an emerging important topic in these years. In this paper, we for the first time propose a contrastive action learning paradigm named AS-CAL. Our approach typically improves existing hand-crafted methods by 10-50% top-1 accuracy.
arXiv Detail & Related papers (2020-08-01T06:37:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.