Related papers: Temporally-Weighted Hierarchical Clustering for Unsupervised Action Segmentation

Temporally-Weighted Hierarchical Clustering for Unsupervised Action Segmentation

URL: http://arxiv.org/abs/2103.11264v2
Date: Tue, 23 Mar 2021 08:16:53 GMT
Title: Temporally-Weighted Hierarchical Clustering for Unsupervised Action Segmentation
Authors: M. Saquib Sarfraz, Naila Murray, Vivek Sharma, Ali Diba, Luc Van Gool, Rainer Stiefelhagen
Abstract summary: Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos. We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training. Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video.
Score: 96.67525775629444
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos and is an important requirement for many video understanding tasks. For this and other video understanding tasks, supervised approaches have achieved encouraging performance but require a high volume of detailed frame-level annotations. We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training. Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video. Our main finding is that representing a video with a 1-nearest neighbor graph by taking into account the time progression is sufficient to form semantically and temporally consistent clusters of frames where each cluster may represent some action in the video. Additionally, we establish strong unsupervised baselines for action segmentation and show significant performance improvements over published unsupervised methods on five challenging action segmentation datasets. Our approach also outperforms weakly-supervised methods by large margins on 4 of these datasets. Interestingly, we also achieve better results than many fully-supervised methods that have reported results on these datasets. Our code is available at https://github.com/ssarfraz/FINCH-Clustering/tree/master/TW-FINCH

Related papers

MoSiC: Optimal-Transport Motion Trajectory for Dense Self-Supervised Learning [66.53533434848369]
We propose a motion-guided self-learning framework that learns densely consistent representations.<n>We improve state-of-the-art by 1% to 6% on six image and video datasets and four evaluation benchmarks.
arXiv Detail & Related papers (2025-06-10T11:20:32Z)
Rethinking Video Segmentation with Masked Video Consistency: Did the Model Learn as Intended? [22.191260650245443]
Video segmentation aims at partitioning video sequences into meaningful segments based on objects or regions of interest within frames. Current video segmentation models are often derived from image segmentation techniques, which struggle to cope with small-scale or class-imbalanced video datasets. We propose a training strategy Masked Video Consistency, which enhances spatial and temporal feature aggregation.
arXiv Detail & Related papers (2024-08-20T08:08:32Z)
Self-supervised Object-Centric Learning for Videos [39.02148880719576]
We propose the first fully unsupervised method for segmenting multiple objects in real-world sequences. Our object-centric learning framework spatially binds objects to slots on each frame and then relates these slots across frames. Our method can successfully segment multiple instances of complex and high-variety classes in YouTube videos.
arXiv Detail & Related papers (2023-10-10T18:03:41Z)
Tracking Anything with Decoupled Video Segmentation [87.07258378407289]
We develop a decoupled video segmentation approach (DEVA) It is composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks.
arXiv Detail & Related papers (2023-09-07T17:59:41Z)
Learning Referring Video Object Segmentation from Weak Annotation [78.45828085350936]
Referring video object segmentation (RVOS) is a task that aims to segment the target object in all video frames based on a sentence describing the object. We propose a new annotation scheme that reduces the annotation effort by 8 times, while providing sufficient supervision for RVOS. Our scheme only requires a mask for the frame where the object first appears and bounding boxes for the rest of the frames.
arXiv Detail & Related papers (2023-08-04T06:50:52Z)
TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and Clustering [27.52568444236988]
We propose an unsupervised approach for learning action classes from untrimmed video sequences. In particular, we propose a temporal embedding network that combines relative time prediction, feature reconstruction, and sequence-to-sequence learning. Based on the identified clusters, we decode the video into coherent temporal segments that correspond to semantically meaningful action classes.
arXiv Detail & Related papers (2023-03-09T10:46:23Z)
Unsupervised Action Segmentation with Self-supervised Feature Learning and Co-occurrence Parsing [32.66011849112014]
temporal action segmentation is a task to classify each frame in the video with an action label. In this work we explore a self-supervised method that operates on a corpus of unlabeled videos and predicts a likely set of temporal segments across the videos. We develop CAP, a novel co-occurrence action parsing algorithm that can not only capture the correlation among sub-actions underlying the structure of activities, but also estimate the temporal trajectory of the sub-actions in an accurate and general way.
arXiv Detail & Related papers (2021-05-29T00:29:40Z)
Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in Videos [159.02703673838639]
We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos. We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks. The additional data provides substantially better generalization performance leading to state-of-the-art results in both the VOS and more challenging tracking domain.
arXiv Detail & Related papers (2021-01-06T18:56:24Z)
Unsupervised Learning of Video Representations via Dense Trajectory Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos. We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation. We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z)
Hierarchical Attention Network for Action Segmentation [45.19890687786009]
The temporal segmentation of events is an essential task and a precursor for the automatic recognition of human actions in the video. We propose a complete end-to-end supervised learning approach that can better learn relationships between actions over time. We evaluate our system on challenging public benchmark datasets, including MERL Shopping, 50 salads, and Georgia Tech Egocentric datasets.
arXiv Detail & Related papers (2020-05-07T02:39:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.