Online Unsupervised Video Object Segmentation via Contrastive Motion
Clustering
- URL: http://arxiv.org/abs/2306.12048v3
- Date: Wed, 17 Jan 2024 07:47:13 GMT
- Title: Online Unsupervised Video Object Segmentation via Contrastive Motion
Clustering
- Authors: Lin Xi, Weihai Chen, Xingming Wu, Zhong Liu, Zhengguo Li
- Abstract summary: Online unsupervised video object segmentation (UVOS) uses the previous frames as its input to automatically separate the primary object(s) from a streaming video without using any further manual annotation.
A major challenge is that the model has no access to the future and must rely solely on the history, i.e., the segmentation mask is predicted from the current frame as soon as it is captured.
In this work, a novel contrastive motion clustering algorithm with an optical flow as its input is proposed for the online UVOS by exploiting the common fate principle that visual elements tend to be perceived as a group if they possess the same
- Score: 27.265597448266988
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Online unsupervised video object segmentation (UVOS) uses the previous frames
as its input to automatically separate the primary object(s) from a streaming
video without using any further manual annotation. A major challenge is that
the model has no access to the future and must rely solely on the history,
i.e., the segmentation mask is predicted from the current frame as soon as it
is captured. In this work, a novel contrastive motion clustering algorithm with
an optical flow as its input is proposed for the online UVOS by exploiting the
common fate principle that visual elements tend to be perceived as a group if
they possess the same motion pattern. We build a simple and effective
auto-encoder to iteratively summarize non-learnable prototypical bases for the
motion pattern, while the bases in turn help learn the representation of the
embedding network. Further, a contrastive learning strategy based on a boundary
prior is developed to improve foreground and background feature discrimination
in the representation learning stage. The proposed algorithm can be optimized
on arbitrarily-scale data i.e., frame, clip, dataset) and performed in an
online fashion. Experiments on $\textit{DAVIS}_{\textit{16}}$, $\textit{FBMS}$,
and $\textit{SegTrackV2}$ datasets show that the accuracy of our method
surpasses the previous state-of-the-art (SoTA) online UVOS method by a margin
of 0.8%, 2.9%, and 1.1%, respectively. Furthermore, by using an online deep
subspace clustering to tackle the motion grouping, our method is able to
achieve higher accuracy at $3\times$ faster inference time compared to SoTA
online UVOS method, and making a good trade-off between effectiveness and
efficiency. Our code is available at https://github.com/xilin1991/ClusterNet.
Related papers
- SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image.
We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z) - Tsanet: Temporal and Scale Alignment for Unsupervised Video Object
Segmentation [21.19216164433897]
Unsupervised Video Object (UVOS) refers to the challenging task of segmenting the prominent object in videos without manual guidance.
We propose a novel framework for UVOS that can address the aforementioned limitations of the two approaches.
We present experimental results on public benchmark datasets, DAVIS 2016 and FBMS, which demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2023-03-08T04:59:43Z) - GOCA: Guided Online Cluster Assignment for Self-Supervised Video
Representation Learning [49.69279760597111]
Clustering is a ubiquitous tool in unsupervised learning.
Most of the existing self-supervised representation learning methods typically cluster samples based on visually dominant features.
We propose a principled way to combine two views. Specifically, we propose a novel clustering strategy where we use the initial cluster assignment of each view as prior to guide the final cluster assignment of the other view.
arXiv Detail & Related papers (2022-07-20T19:26:55Z) - Box Supervised Video Segmentation Proposal Network [3.384080569028146]
We propose a box-supervised video object segmentation proposal network, which takes advantage of intrinsic video properties.
The proposed method outperforms the state-of-the-art self-supervised benchmark by 16.4% and 6.9%.
We provide extensive tests and ablations on the datasets, demonstrating the robustness of our method.
arXiv Detail & Related papers (2022-02-14T20:38:28Z) - Self-supervised Video Representation Learning with Cross-Stream
Prototypical Contrasting [2.2530496464901106]
"Video Cross-Stream Prototypical Contrasting" is a novel method which predicts consistent prototype assignments from both RGB and optical flow views.
We obtain state-of-the-art results on nearest neighbour video retrieval and action recognition.
arXiv Detail & Related papers (2021-06-18T13:57:51Z) - Spatiotemporal Graph Neural Network based Mask Reconstruction for Video
Object Segmentation [70.97625552643493]
This paper addresses the task of segmenting class-agnostic objects in semi-supervised setting.
We propose a novel graph neuralS network (TG-Net) which captures the local contexts by utilizing all proposals.
arXiv Detail & Related papers (2020-12-10T07:57:44Z) - Unsupervised Learning of Video Representations via Dense Trajectory
Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation.
We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z) - Learning Fast and Robust Target Models for Video Object Segmentation [83.3382606349118]
Video object segmentation (VOS) is a highly challenging problem since the initial mask, defining the target object, is only given at test-time.
Most previous approaches fine-tune segmentation networks on the first frame, resulting in impractical frame-rates and risk of overfitting.
We propose a novel VOS architecture consisting of two network components.
arXiv Detail & Related papers (2020-02-27T21:58:06Z) - Directional Deep Embedding and Appearance Learning for Fast Video Object
Segmentation [11.10636117512819]
We propose a directional deep embedding and YouTube appearance learning (DEmbed) method, which is free of the online fine-tuning process.
Our method achieves a state-of-the-art VOS performance without using online fine-tuning.
arXiv Detail & Related papers (2020-02-17T01:51:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.