SMC-NCA: Semantic-guided Multi-level Contrast for Semi-supervised Temporal Action Segmentation
- URL: http://arxiv.org/abs/2312.12347v3
- Date: Fri, 19 Jul 2024 11:50:33 GMT
- Title: SMC-NCA: Semantic-guided Multi-level Contrast for Semi-supervised Temporal Action Segmentation
- Authors: Feixiang Zhou, Zheheng Jiang, Huiyu Zhou, Xuelong Li,
- Abstract summary: Semi-supervised temporal action segmentation (SS-TA) aims to perform frame-wise classification in long untrimmed videos.
Recent studies have shown the potential of contrastive learning in unsupervised representation learning using unlabelled data.
We propose a novel Semantic-guided Multi-level Contrast scheme with a Neighbourhood-Consistency-Aware unit (SMC-NCA) to extract strong frame-wise representations.
- Score: 53.010417880335424
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semi-supervised temporal action segmentation (SS-TA) aims to perform frame-wise classification in long untrimmed videos, where only a fraction of videos in the training set have labels. Recent studies have shown the potential of contrastive learning in unsupervised representation learning using unlabelled data. However, learning the representation of each frame by unsupervised contrastive learning for action segmentation remains an open and challenging problem. In this paper, we propose a novel Semantic-guided Multi-level Contrast scheme with a Neighbourhood-Consistency-Aware unit (SMC-NCA) to extract strong frame-wise representations for SS-TAS. Specifically, for representation learning, SMC is first used to explore intra- and inter-information variations in a unified and contrastive way, based on action-specific semantic information and temporal information highlighting relations between actions. Then, the NCA module, which is responsible for enforcing spatial consistency between neighbourhoods centered at different frames to alleviate over-segmentation issues, works alongside SMC for semi-supervised learning (SSL). Our SMC outperforms the other state-of-the-art methods on three benchmarks, offering improvements of up to 17.8% and 12.6% in terms of Edit distance and accuracy, respectively. Additionally, the NCA unit results in significantly better segmentation performance in the presence of only 5% labelled videos. We also demonstrate the generalizability and effectiveness of the proposed method on our Parkinson Disease's Mouse Behaviour (PDMB) dataset. Code is available at https://github.com/FeixiangZhou/SMC-NCA.
Related papers
- An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition [49.45660055499103]
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training.
Previous research has focused on aligning sequences' visual and semantic spatial distributions.
We introduce a new loss function sampling method to obtain a tight and robust representation.
arXiv Detail & Related papers (2024-06-02T06:53:01Z) - Multi-Scale Cross Contrastive Learning for Semi-Supervised Medical Image
Segmentation [14.536384387956527]
We develop a novel Multi-Scale Cross Supervised Contrastive Learning framework to segment structures in medical images.
Our approach contrasts multi-scale features based on ground-truth and cross-predicted labels, in order to extract robust feature representations.
It outperforms state-of-the-art semi-supervised methods by more than 3.0% in Dice.
arXiv Detail & Related papers (2023-06-25T16:55:32Z) - Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal
Action Localization [98.66318678030491]
Weakly-supervised temporal action localization aims to localize and recognize actions in untrimmed videos with only video-level category labels during training.
We propose a novel Proposal-based Multiple Instance Learning (P-MIL) framework that directly classifies the candidate proposals in both the training and testing stages.
arXiv Detail & Related papers (2023-05-29T02:48:04Z) - Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization.
Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting.
Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z) - MuSCLe: A Multi-Strategy Contrastive Learning Framework for Weakly
Supervised Semantic Segmentation [39.858844102571176]
Weakly supervised semantic segmentation (WSSS) relies on weak labels such as image level annotations rather than pixel level annotations required by supervised semantic segmentation (SSS) methods.
We propose a novel Multi-Strategy Contrastive Learning (MuSCLe) framework to obtain enhanced feature representations and improve WSSS performance.
arXiv Detail & Related papers (2022-01-18T14:38:50Z) - Iterative Frame-Level Representation Learning And Classification For
Semi-Supervised Temporal Action Segmentation [25.08516972520265]
Temporal action segmentation classifies the action of each frame in (long) video sequences.
We propose the first semi-supervised method for temporal action segmentation.
arXiv Detail & Related papers (2021-12-02T16:47:24Z) - 3D Human Action Representation Learning via Cross-View Consistency
Pursuit [52.19199260960558]
We propose a Cross-view Contrastive Learning framework for unsupervised 3D skeleton-based action Representation (CrosSCLR)
CrosSCLR consists of both single-view contrastive learning (SkeletonCLR) and cross-view consistent knowledge mining (CVC-KM) modules, integrated in a collaborative learning manner.
arXiv Detail & Related papers (2021-04-29T16:29:41Z) - Learning by Aligning Videos in Time [10.075645944474287]
We present a self-supervised approach for learning video representations using temporal video alignment as a pretext task.
We leverage a novel combination of temporal alignment loss and temporal regularization terms, which can be used as supervision signals for training an encoder network.
arXiv Detail & Related papers (2021-03-31T17:55:52Z) - Spatial-Temporal Multi-Cue Network for Continuous Sign Language
Recognition [141.24314054768922]
We propose a spatial-temporal multi-cue (STMC) network to solve the vision-based sequence learning problem.
To validate the effectiveness, we perform experiments on three large-scale CSLR benchmarks.
arXiv Detail & Related papers (2020-02-08T15:38:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.