Learning Temporal Distribution and Spatial Correlation Towards Universal
Moving Object Segmentation
- URL: http://arxiv.org/abs/2304.09949v4
- Date: Fri, 8 Mar 2024 00:00:10 GMT
- Title: Learning Temporal Distribution and Spatial Correlation Towards Universal
Moving Object Segmentation
- Authors: Guanfang Dong, Chenqiu Zhao, Xichen Pan, Anup Basu
- Abstract summary: We propose a method called Learning Temporal Distribution and Spatial Correlation (LTS) that has the potential to be a general solution for universal moving object segmentation.
In the proposed approach, the distribution from temporal pixels is first learned by our Defect Iterative Distribution Learning (DIDL) network for scene-independent segmentation.
The proposed approach performs well for almost all videos from diverse and complex natural scenes with fixed parameters.
- Score: 8.807766029291901
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The goal of moving object segmentation is separating moving objects from
stationary backgrounds in videos. One major challenge in this problem is how to
develop a universal model for videos from various natural scenes since previous
methods are often effective only in specific scenes. In this paper, we propose
a method called Learning Temporal Distribution and Spatial Correlation (LTS)
that has the potential to be a general solution for universal moving object
segmentation. In the proposed approach, the distribution from temporal pixels
is first learned by our Defect Iterative Distribution Learning (DIDL) network
for a scene-independent segmentation. Notably, the DIDL network incorporates
the use of an improved product distribution layer that we have newly derived.
Then, the Stochastic Bayesian Refinement (SBR) Network, which learns the
spatial correlation, is proposed to improve the binary mask generated by the
DIDL network. Benefiting from the scene independence of the temporal
distribution and the accuracy improvement resulting from the spatial
correlation, the proposed approach performs well for almost all videos from
diverse and complex natural scenes with fixed parameters. Comprehensive
experiments on standard datasets including LASIESTA, CDNet2014, BMC, SBMI2015
and 128 real world videos demonstrate the superiority of proposed approach
compared to state-of-the-art methods with or without the use of deep learning
networks. To the best of our knowledge, this work has high potential to be a
general solution for moving object segmentation in real world environments. The
code and real-world videos can be found on GitHub
https://github.com/guanfangdong/LTS-UniverisalMOS.
Related papers
- ReferEverything: Towards Segmenting Everything We Can Speak of in Videos [42.88584315033116]
We present REM, a framework for segmenting concepts in video that can be described through natural language.
Our method capitalizes on visual representations learned by video diffusion models on Internet-scale datasets.
arXiv Detail & Related papers (2024-10-30T17:59:26Z) - SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Global Motion Understanding in Large-Scale Video Object Segmentation [0.499320937849508]
We show that transferring knowledge from other domains of video understanding combined with large-scale learning can improve robustness of Video Object (VOS) under complex circumstances.
Namely, we focus on integrating scene global motion knowledge to improve large-scale semi-supervised Video Object.
We present WarpFormer, an architecture for semi-supervised Video Object that exploits existing knowledge in motion understanding to conduct smoother propagation and more accurate matching.
arXiv Detail & Related papers (2024-05-11T15:09:22Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Panoptic Out-of-Distribution Segmentation [11.388678390784195]
We propose Panoptic Out-of Distribution for joint pixel-level semantic in-distribution and out-of-distribution classification with instance prediction.
We make the dataset, code, and trained models publicly available at http://pods.cs.uni-freiburg.de.
arXiv Detail & Related papers (2023-10-18T08:38:31Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Mitigating Representation Bias in Action Recognition: Algorithms and
Benchmarks [76.35271072704384]
Deep learning models perform poorly when applied to videos with rare scenes or objects.
We tackle this problem from two different angles: algorithm and dataset.
We show that the debiased representation can generalize better when transferred to other datasets and tasks.
arXiv Detail & Related papers (2022-09-20T00:30:35Z) - Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance.
We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z) - Group Contextualization for Video Recognition [80.3842253625557]
Group contextualization (GC) can boost the performance of 2D-CNN (e.g., TSN) and TSM.
GC embeds feature with four different kinds of contexts in parallel.
Group contextualization can boost the performance of 2D-CNN (e.g., TSN) to a level comparable to the state-the-art video networks.
arXiv Detail & Related papers (2022-03-18T01:49:40Z) - Unsupervised Learning Consensus Model for Dynamic Texture Videos
Segmentation [12.462608802359936]
We present an effective unsupervised learning consensus model for the segmentation of dynamic texture (ULCM)
In the proposed model, the set of values of the requantized local binary patterns (LBP) histogram around the pixel to be classified are used as features.
Experiments conducted on the challenging SynthDB dataset show that ULCM is significantly faster, easier to code, simple and has limited parameters.
arXiv Detail & Related papers (2020-06-29T16:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.