C2F-TCN: A Framework for Semi and Fully Supervised Temporal Action
Segmentation
- URL: http://arxiv.org/abs/2212.11078v1
- Date: Tue, 20 Dec 2022 14:53:46 GMT
- Title: C2F-TCN: A Framework for Semi and Fully Supervised Temporal Action
Segmentation
- Authors: Dipika Singhania, Rahul Rahaman, Angela Yao
- Abstract summary: Temporal action segmentation tags action labels for every frame in an input untrimmed video containing multiple actions in a sequence.
We propose an encoder-decoder-style architecture named C2F-TCN featuring a "coarse-to-fine" ensemble of decoder outputs.
We show that the architecture is flexible for both supervised and representation learning.
- Score: 20.182928938110923
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Temporal action segmentation tags action labels for every frame in an input
untrimmed video containing multiple actions in a sequence. For the task of
temporal action segmentation, we propose an encoder-decoder-style architecture
named C2F-TCN featuring a "coarse-to-fine" ensemble of decoder outputs. The
C2F-TCN framework is enhanced with a novel model agnostic temporal feature
augmentation strategy formed by the computationally inexpensive strategy of the
stochastic max-pooling of segments. It produces more accurate and
well-calibrated supervised results on three benchmark action segmentation
datasets. We show that the architecture is flexible for both supervised and
representation learning. In line with this, we present a novel unsupervised way
to learn frame-wise representation from C2F-TCN. Our unsupervised learning
approach hinges on the clustering capabilities of the input features and the
formation of multi-resolution features from the decoder's implicit structure.
Further, we provide the first semi-supervised temporal action segmentation
results by merging representation learning with conventional supervised
learning. Our semi-supervised learning scheme, called
``Iterative-Contrastive-Classify (ICC)'', progressively improves in performance
with more labeled data. The ICC semi-supervised learning in C2F-TCN, with 40%
labeled videos, performs similar to fully supervised counterparts.
Related papers
- SMC-NCA: Semantic-guided Multi-level Contrast for Semi-supervised Temporal Action Segmentation [53.010417880335424]
Semi-supervised temporal action segmentation (SS-TA) aims to perform frame-wise classification in long untrimmed videos.
Recent studies have shown the potential of contrastive learning in unsupervised representation learning using unlabelled data.
We propose a novel Semantic-guided Multi-level Contrast scheme with a Neighbourhood-Consistency-Aware unit (SMC-NCA) to extract strong frame-wise representations.
arXiv Detail & Related papers (2023-12-19T17:26:44Z) - PointCMP: Contrastive Mask Prediction for Self-supervised Learning on
Point Cloud Videos [58.18707835387484]
We propose a contrastive mask prediction framework for self-supervised learning on point cloud videos.
PointCMP employs a two-branch structure to achieve simultaneous learning of both local and globaltemporal information.
Our framework achieves the state-of-the-art performance on benchmark datasets and outperforms existing full-supervised counterparts.
arXiv Detail & Related papers (2023-05-06T15:47:48Z) - Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised
Framework with Spatio-Temporal Collaboration [13.284951215948052]
We present a novel weakly supervised framework with textbfS-patiotextbfTemporal textbfClaboration for instance textbfSegmentation in videos.
Our method achieves strong performance and even outperforms fully supervised TrackR-CNN and MaskTrack R-CNN.
arXiv Detail & Related papers (2022-12-15T02:44:13Z) - CenterCLIP: Token Clustering for Efficient Text-Video Retrieval [67.21528544724546]
In CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive frames in videos.
This significantly increases computation costs and hinders the deployment of video retrieval models in web applications.
In this paper, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones.
arXiv Detail & Related papers (2022-05-02T12:02:09Z) - Frame-wise Action Representations for Long Videos via Sequence
Contrastive Learning [44.412145665354736]
We introduce a novel contrastive action representation learning framework to learn frame-wise action representations.
Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views.
Our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks.
arXiv Detail & Related papers (2022-03-28T17:59:54Z) - Iterative Frame-Level Representation Learning And Classification For
Semi-Supervised Temporal Action Segmentation [25.08516972520265]
Temporal action segmentation classifies the action of each frame in (long) video sequences.
We propose the first semi-supervised method for temporal action segmentation.
arXiv Detail & Related papers (2021-12-02T16:47:24Z) - Unsupervised Action Segmentation with Self-supervised Feature Learning
and Co-occurrence Parsing [32.66011849112014]
temporal action segmentation is a task to classify each frame in the video with an action label.
In this work we explore a self-supervised method that operates on a corpus of unlabeled videos and predicts a likely set of temporal segments across the videos.
We develop CAP, a novel co-occurrence action parsing algorithm that can not only capture the correlation among sub-actions underlying the structure of activities, but also estimate the temporal trajectory of the sub-actions in an accurate and general way.
arXiv Detail & Related papers (2021-05-29T00:29:40Z) - BiCnet-TKS: Learning Efficient Spatial-Temporal Representation for Video
Person Re-Identification [86.73532136686438]
We present an efficient spatial-temporal representation for video person re-identification (reID)
We propose a Bilateral Complementary Network (BiCnet) for spatial complementarity modeling.
BiCnet-TKS outperforms state-of-the-arts with about 50% less computations.
arXiv Detail & Related papers (2021-04-30T06:44:34Z) - Temporally-Weighted Hierarchical Clustering for Unsupervised Action
Segmentation [96.67525775629444]
Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos.
We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training.
Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video.
arXiv Detail & Related papers (2021-03-20T23:30:01Z) - SeCo: Exploring Sequence Supervision for Unsupervised Representation
Learning [114.58986229852489]
In this paper, we explore the basic and generic supervision in the sequence from spatial, sequential and temporal perspectives.
We derive a particular form named Contrastive Learning (SeCo)
SeCo shows superior results under the linear protocol on action recognition, untrimmed activity recognition and object tracking.
arXiv Detail & Related papers (2020-08-03T15:51:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.