Improving Weakly-supervised Video Instance Segmentation by Leveraging Spatio-temporal Consistency
- URL: http://arxiv.org/abs/2408.16661v2
- Date: Sat, 23 Nov 2024 20:05:05 GMT
- Title: Improving Weakly-supervised Video Instance Segmentation by Leveraging Spatio-temporal Consistency
- Authors: Farnoosh Arefi, Amir M. Mansourian, Shohreh Kasaei,
- Abstract summary: We introduce a weakly-supervised method called Eigen VIS that achieves competitive accuracy compared to other VIS approaches.
This method is based on two key innovations: a Temporal Eigenvalue Loss (TEL) and a clip-level Quality Co-efficient (QCC)
The code is available on https://github.com/farnooshar/EigenVIS.
- Score: 9.115508086522887
- License:
- Abstract: The performance of Video Instance Segmentation (VIS) methods has improved significantly with the advent of transformer networks. However, these networks often face challenges in training due to the high annotation cost. To address this, unsupervised and weakly-supervised methods have been developed to reduce the dependency on annotations. This work introduces a novel weakly-supervised method called Eigen-Cluster VIS that, without requiring any mask annotations, achieves competitive accuracy compared to other VIS approaches. This method is based on two key innovations: a Temporal Eigenvalue Loss (TEL) and a clip-level Quality Cluster Coefficient (QCC). The TEL ensures temporal coherence by leveraging the eigenvalues of the Laplacian matrix derived from graph adjacency matrices. By minimizing the mean absolute error between the eigenvalues of adjacent frames, this loss function promotes smooth transitions and stable segmentation boundaries over time, reducing temporal discontinuities and improving overall segmentation quality. The QCC employs the K-means method to ensure the quality of spatio-temporal clusters without relying on ground truth masks. Using the Davies-Bouldin score, the QCC provides an unsupervised measure of feature discrimination, allowing the model to self-evaluate and adapt to varying object distributions, enhancing robustness during the testing phase. These enhancements are computationally efficient and straightforward, offering significant performance gains without additional annotated data. The proposed Eigen-Cluster VIS method is evaluated on the YouTube-Video Instance Segmentation (YouTube-VIS) 2019/2021 and Occluded Video Instance Segmentation (OVIS) datasets, demonstrating that it effectively narrows the performance gap between the fully-supervised and weakly-supervised VIS approaches. The code is available on https://github.com/farnooshar/EigenClusterVIS
Related papers
- Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks.
We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z) - Temporally Consistent Referring Video Object Segmentation with Hybrid Memory [98.80249255577304]
We propose an end-to-end R-VOS paradigm that explicitly models temporal consistency alongside the referring segmentation.
Features of frames with automatically generated high-quality reference masks are propagated to segment remaining frames.
Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin.
arXiv Detail & Related papers (2024-03-28T13:32:49Z) - Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation [76.68301884987348]
We propose a simple yet effective approach for self-supervised video object segmentation (VOS)
Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust-temporal segmentation correspondences in videos.
Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and excels in complex real-world multi-object video segmentation tasks.
arXiv Detail & Related papers (2023-11-29T18:47:17Z) - Skip-Attention: Improving Vision Transformers by Paying Less Attention [55.47058516775423]
Vision computation transformers (ViTs) use expensive self-attention operations in every layer.
We propose SkipAt, a method to reuse self-attention from preceding layers to approximate attention at one or more subsequent layers.
We show the effectiveness of our method in image classification and self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS.
arXiv Detail & Related papers (2023-01-05T18:59:52Z) - Video Mask Transfiner for High-Quality Video Instance Segmentation [102.50936366583106]
Video Mask Transfiner (VMT) is capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure.
Based on our VMT architecture, we design an automated annotation refinement approach by iterative training and self-correction.
We compare VMT with the most recent state-of-the-art methods on the HQ-YTVIS, as well as the Youtube-VIS, OVIS and BDD100K MOTS.
arXiv Detail & Related papers (2022-07-28T11:13:37Z) - On the Eigenvalues of Global Covariance Pooling for Fine-grained Visual
Recognition [65.67315418971688]
We show that truncating small eigenvalues of the Global Covariance Pooling (GCP) can attain smoother gradient.
On fine-grained datasets, truncating the small eigenvalues would make the model fail to converge.
Inspired by this observation, we propose a network branch dedicated to magnifying the importance of small eigenvalues.
arXiv Detail & Related papers (2022-05-26T11:41:36Z) - Temporal Transductive Inference for Few-Shot Video Object Segmentation [27.140141181513425]
Few-shot object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training.
Key to our approach is the use of both global and local temporal constraints.
Empirically, our model outperforms state-of-the-art meta-learning approaches in terms of mean intersection over union on YouTube-VIS by 2.8%.
arXiv Detail & Related papers (2022-03-27T14:08:30Z) - Bayesian Nonparametric Submodular Video Partition for Robust Anomaly
Detection [9.145168943972067]
Multiple-instance learning (MIL) provides an effective way to tackle the video anomaly detection problem.
We propose to conduct novel Bayesian non-parametric submodular video partition (BN-SVP) to significantly improve MIL model training.
Our theoretical analysis ensures a strong performance guarantee of the proposed algorithm.
arXiv Detail & Related papers (2022-03-24T04:00:49Z) - Improved Dual Correlation Reduction Network [40.792587861237166]
We propose a novel deep graph clustering algorithm termed Improved Dual Correlation Reduction Network (IDCRN)
By approximating the cross-view feature correlation matrix to an identity matrix, we reduce the redundancy between different dimensions of features.
We also avoid the collapsed representation caused by the over-smoothing issue in Graph Convolutional Networks (GCNs) through an introduced propagation regularization term.
arXiv Detail & Related papers (2022-02-25T07:48:32Z) - Exploring the Semi-supervised Video Object Segmentation Problem from a
Cyclic Perspective [36.4057004419079]
In this paper, we place the semi-supervised video object segmentation problem into a cyclic workflow.
We show that a cyclic mechanism incorporated to the standard sequential flow can produce more consistent representations for pixel-wise correspondance.
We also develop cycle effective receptive field (cycle-ERF) based on gradient correction process to provide a new perspective into analyzing object-specific regions of interests.
arXiv Detail & Related papers (2021-11-02T01:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.