Iterative Frame-Level Representation Learning And Classification For
Semi-Supervised Temporal Action Segmentation
- URL: http://arxiv.org/abs/2112.01402v1
- Date: Thu, 2 Dec 2021 16:47:24 GMT
- Title: Iterative Frame-Level Representation Learning And Classification For
Semi-Supervised Temporal Action Segmentation
- Authors: Dipika Singhania, Rahul Rahaman, Angela Yao
- Abstract summary: Temporal action segmentation classifies the action of each frame in (long) video sequences.
We propose the first semi-supervised method for temporal action segmentation.
- Score: 25.08516972520265
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Temporal action segmentation classifies the action of each frame in (long)
video sequences. Due to the high cost of frame-wise labeling, we propose the
first semi-supervised method for temporal action segmentation. Our method
hinges on unsupervised representation learning, which, for temporal action
segmentation, poses unique challenges. Actions in untrimmed videos vary in
length and have unknown labels and start/end times. Ordering of actions across
videos may also vary. We propose a novel way to learn frame-wise
representations from temporal convolutional networks (TCNs) by clustering input
features with added time-proximity condition and multi-resolution similarity.
By merging representation learning with conventional supervised learning, we
develop an "Iterative-Contrast-Classify (ICC)" semi-supervised learning scheme.
With more labelled data, ICC progressively improves in performance; ICC
semi-supervised learning, with 40% labelled videos, performs similar to
fully-supervised counterparts. Our ICC improves MoF by {+1.8, +5.6, +2.5}% on
Breakfast, 50Salads and GTEA respectively for 100% labelled videos.
Related papers
- SMC-NCA: Semantic-guided Multi-level Contrast for Semi-supervised Temporal Action Segmentation [53.010417880335424]
Semi-supervised temporal action segmentation (SS-TA) aims to perform frame-wise classification in long untrimmed videos.
Recent studies have shown the potential of contrastive learning in unsupervised representation learning using unlabelled data.
We propose a novel Semantic-guided Multi-level Contrast scheme with a Neighbourhood-Consistency-Aware unit (SMC-NCA) to extract strong frame-wise representations.
arXiv Detail & Related papers (2023-12-19T17:26:44Z) - Unified Mask Embedding and Correspondence Learning for Self-Supervised
Video Segmentation [76.40565872257709]
We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning.
It is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos.
Our algorithm sets state-of-the-arts on two standard benchmarks (i.e., DAVIS17 and YouTube-VOS)
arXiv Detail & Related papers (2023-03-17T16:23:36Z) - TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and
Clustering [27.52568444236988]
We propose an unsupervised approach for learning action classes from untrimmed video sequences.
In particular, we propose a temporal embedding network that combines relative time prediction, feature reconstruction, and sequence-to-sequence learning.
Based on the identified clusters, we decode the video into coherent temporal segments that correspond to semantically meaningful action classes.
arXiv Detail & Related papers (2023-03-09T10:46:23Z) - C2F-TCN: A Framework for Semi and Fully Supervised Temporal Action
Segmentation [20.182928938110923]
Temporal action segmentation tags action labels for every frame in an input untrimmed video containing multiple actions in a sequence.
We propose an encoder-decoder-style architecture named C2F-TCN featuring a "coarse-to-fine" ensemble of decoder outputs.
We show that the architecture is flexible for both supervised and representation learning.
arXiv Detail & Related papers (2022-12-20T14:53:46Z) - CenterCLIP: Token Clustering for Efficient Text-Video Retrieval [67.21528544724546]
In CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive frames in videos.
This significantly increases computation costs and hinders the deployment of video retrieval models in web applications.
In this paper, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones.
arXiv Detail & Related papers (2022-05-02T12:02:09Z) - Frame-wise Action Representations for Long Videos via Sequence
Contrastive Learning [44.412145665354736]
We introduce a novel contrastive action representation learning framework to learn frame-wise action representations.
Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views.
Our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks.
arXiv Detail & Related papers (2022-03-28T17:59:54Z) - Unsupervised Pre-training for Temporal Action Localization Tasks [76.01985780118422]
We propose a self-supervised pretext task, coined as Pseudo Action localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action localization tasks (UP-TAL)
Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos.
The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them.
arXiv Detail & Related papers (2022-03-25T12:13:43Z) - Learning to Associate Every Segment for Video Panoptic Segmentation [123.03617367709303]
We learn coarse segment-level matching and fine pixel-level matching together.
We show that our per-frame computation model can achieve new state-of-the-art results on Cityscapes-VPS and VIPER datasets.
arXiv Detail & Related papers (2021-06-17T13:06:24Z) - Temporally-Weighted Hierarchical Clustering for Unsupervised Action
Segmentation [96.67525775629444]
Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos.
We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training.
Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video.
arXiv Detail & Related papers (2021-03-20T23:30:01Z) - SCT: Set Constrained Temporal Transformer for Set Supervised Action
Segmentation [22.887397951846353]
Weakly supervised approaches aim at learning temporal action segmentation from videos that are only weakly labeled.
We propose an approach that can be trained end-to-end on such data.
We evaluate our approach on three datasets where the approach achieves state-of-the-art results.
arXiv Detail & Related papers (2020-03-31T14:51:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.