Improving Semantic Segmentation through Spatio-Temporal Consistency
Learned from Videos
- URL: http://arxiv.org/abs/2004.05324v2
- Date: Wed, 20 May 2020 23:55:36 GMT
- Title: Improving Semantic Segmentation through Spatio-Temporal Consistency
Learned from Videos
- Authors: Ankita Pasad, Ariel Gordon, Tsung-Yi Lin, Anelia Angelova
- Abstract summary: We leverage unsupervised learning of depth, egomotion, and camera intrinsics to improve single-image semantic segmentation.
The predicted depth, egomotion, and camera intrinsics are used to provide an additional supervision signal to the segmentation model.
- Score: 39.25927216187176
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We leverage unsupervised learning of depth, egomotion, and camera intrinsics
to improve the performance of single-image semantic segmentation, by enforcing
3D-geometric and temporal consistency of segmentation masks across video
frames. The predicted depth, egomotion, and camera intrinsics are used to
provide an additional supervision signal to the segmentation model,
significantly enhancing its quality, or, alternatively, reducing the number of
labels the segmentation model needs. Our experiments were performed on the
ScanNet dataset.
Related papers
- Rethinking Video Segmentation with Masked Video Consistency: Did the Model Learn as Intended? [22.191260650245443]
Video segmentation aims at partitioning video sequences into meaningful segments based on objects or regions of interest within frames.
Current video segmentation models are often derived from image segmentation techniques, which struggle to cope with small-scale or class-imbalanced video datasets.
We propose a training strategy Masked Video Consistency, which enhances spatial and temporal feature aggregation.
arXiv Detail & Related papers (2024-08-20T08:08:32Z) - 3D-Aware Instance Segmentation and Tracking in Egocentric Videos [107.10661490652822]
Egocentric videos present unique challenges for 3D scene understanding.
This paper introduces a novel approach to instance segmentation and tracking in first-person video.
By incorporating spatial and temporal cues, we achieve superior performance compared to state-of-the-art 2D approaches.
arXiv Detail & Related papers (2024-08-19T10:08:25Z) - MCDS-VSS: Moving Camera Dynamic Scene Video Semantic Segmentation by Filtering with Self-Supervised Geometry and Motion [17.50161162624179]
Self-driving cars rely on reliable semantic environment perception for decision making.
We propose MCDS-VSS, a structured filter model that learns in a self-supervised manner to estimate scene geometry and ego-motion of the camera.
Our model parses automotive scenes into multiple interpretable representations such as scene geometry, ego-motion, and object motion.
arXiv Detail & Related papers (2024-05-30T10:33:14Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - MonoDVPS: A Self-Supervised Monocular Depth Estimation Approach to
Depth-aware Video Panoptic Segmentation [3.2489082010225494]
We propose a novel solution with a multi-task network that performs monocular depth estimation and video panoptic segmentation.
We introduce panoptic-guided depth losses and a novel panoptic masking scheme for moving objects to avoid corrupting the training signal.
arXiv Detail & Related papers (2022-10-14T07:00:42Z) - Temporally stable video segmentation without video annotations [6.184270985214255]
We introduce a method to adapt still image segmentation models to video in an unsupervised manner.
We verify that the consistency measure is well correlated with human judgement via a user study.
We observe improvements in the generated segmented videos with minimal loss of accuracy.
arXiv Detail & Related papers (2021-10-17T18:59:11Z) - Three Ways to Improve Semantic Segmentation with Self-Supervised Depth
Estimation [90.87105131054419]
We present a framework for semi-supervised semantic segmentation, which is enhanced by self-supervised monocular depth estimation from unlabeled image sequences.
We validate the proposed model on the Cityscapes dataset, where all three modules demonstrate significant performance gains.
arXiv Detail & Related papers (2020-12-19T21:18:03Z) - Beyond Single Stage Encoder-Decoder Networks: Deep Decoders for Semantic
Image Segmentation [56.44853893149365]
Single encoder-decoder methodologies for semantic segmentation are reaching their peak in terms of segmentation quality and efficiency per number of layers.
We propose a new architecture based on a decoder which uses a set of shallow networks for capturing more information content.
In order to further improve the architecture we introduce a weight function which aims to re-balance classes to increase the attention of the networks to under-represented objects.
arXiv Detail & Related papers (2020-07-19T18:44:34Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z) - Unsupervised Learning Consensus Model for Dynamic Texture Videos
Segmentation [12.462608802359936]
We present an effective unsupervised learning consensus model for the segmentation of dynamic texture (ULCM)
In the proposed model, the set of values of the requantized local binary patterns (LBP) histogram around the pixel to be classified are used as features.
Experiments conducted on the challenging SynthDB dataset show that ULCM is significantly faster, easier to code, simple and has limited parameters.
arXiv Detail & Related papers (2020-06-29T16:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.