CRVOS: Clue Refining Network for Video Object Segmentation
- URL: http://arxiv.org/abs/2002.03651v4
- Date: Tue, 2 Jun 2020 08:15:08 GMT
- Title: CRVOS: Clue Refining Network for Video Object Segmentation
- Authors: Suhwan Cho, MyeongAh Cho, Tae-young Chung, Heansung Lee, Sangyoun Lee
- Abstract summary: We propose a real-time network, Clue Network for Video Object refining (CRVOS), that does not have any intermediate network to efficiently deal with these scenarios.
Our proposed method shows the fastest fps speed among the existing methods with a competitive accuracy.
On DAVIS 2016 set, our method achieves 63.5 fps and J&F score of 81.6%.
- Score: 5.947279761429668
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The encoder-decoder based methods for semi-supervised video object
segmentation (Semi-VOS) have received extensive attention due to their superior
performances. However, most of them have complex intermediate networks which
generate strong specifiers to be robust against challenging scenarios, and this
is quite inefficient when dealing with relatively simple scenarios. To solve
this problem, we propose a real-time network, Clue Refining Network for Video
Object Segmentation (CRVOS), that does not have any intermediate network to
efficiently deal with these scenarios. In this work, we propose a simple
specifier, referred to as the Clue, which consists of the previous frame's
coarse mask and coordinates information. We also propose a novel refine module
which shows the better performance compared with the general ones by using a
deconvolution layer instead of a bilinear upsampling layer. Our proposed method
shows the fastest speed among the existing methods with a competitive accuracy.
On DAVIS 2016 validation set, our method achieves 63.5 fps and J&F score of
81.6%.
Related papers
- Space-time Reinforcement Network for Video Object Segmentation [16.67780344875854]
Video object segmentation (VOS) networks typically use memory-based methods.
These methods suffer from two issues: 1) Challenging data can destroy the space-time coherence between adjacent video frames, and 2) Pixel-level matching will lead to undesired mismatching.
In this paper, we propose to generate an auxiliary frame between adjacent frames, serving as an implicit short-temporal reference for the query one.
arXiv Detail & Related papers (2024-05-07T06:26:30Z) - 1st Place Solution for 5th LSVOS Challenge: Referring Video Object
Segmentation [65.45702890457046]
We integrate strengths of leading RVOS models to build up an effective paradigm.
To improve the consistency and quality of masks, we propose Two-Stage Multi-Model Fusion strategy.
Our method achieves 75.7% J&F on Ref-Youtube-VOS validation set and 70% J&F on test set, which ranks 1st place on 5th Large-scale Video Object Challenge (ICCV 2023) track 3.
arXiv Detail & Related papers (2024-01-01T04:24:48Z) - Cross-CBAM: A Lightweight network for Scene Segmentation [2.064612766965483]
We present the Cross-CBAM network, a novel lightweight network for real-time semantic segmentation.
In experiments on the Cityscapes dataset and Camvid dataset, we achieve 73.4% mIoU with a speed of 240.9FPS and 77.2% mIoU with a speed of 88.6FPS on NVIDIA GTX 1080Ti.
arXiv Detail & Related papers (2023-06-04T09:03:05Z) - Deep Unsupervised Key Frame Extraction for Efficient Video
Classification [63.25852915237032]
This work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC)
The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically.
Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification.
arXiv Detail & Related papers (2022-11-12T20:45:35Z) - Stage-Aware Feature Alignment Network for Real-Time Semantic
Segmentation of Street Scenes [59.81228011432776]
We present a novel Stage-aware Feature Alignment Network (SFANet) for real-time semantic segmentation of street scenes.
By taking into account the unique role of each stage in the decoder, a novel stage-aware Feature Enhancement Block (FEB) is designed to enhance spatial details and contextual information of feature maps from the encoder.
Experimental results show that the proposed SFANet exhibits a good balance between accuracy and speed for real-time semantic segmentation of street scenes.
arXiv Detail & Related papers (2022-03-08T11:46:41Z) - Adaptive Focus for Efficient Video Recognition [29.615394426035074]
We propose a reinforcement learning based approach for efficient spatially adaptive video recognition (AdaFocus)
A light-weighted ConvNet is first adopted to quickly process the full video sequence, whose features are used by a recurrent policy network to localize the most task-relevant regions.
During offline inference, once the informative patch sequence has been generated, the bulk of computation can be done in parallel, and is efficient on modern GPU devices.
arXiv Detail & Related papers (2021-05-07T13:24:47Z) - Rethinking BiSeNet For Real-time Semantic Segmentation [6.622485130017622]
BiSeNet has been proved to be a popular two-stream network for real-time segmentation.
We propose a novel structure named Short-Term Dense Concatenate network (STDC) by removing structure redundancy.
arXiv Detail & Related papers (2021-04-27T13:49:47Z) - Learning Dynamic Network Using a Reuse Gate Function in Semi-supervised
Video Object Segmentation [27.559093073097483]
Current approaches for Semi-supervised Video Object (Semi-VOS) propagates information from previous frames to generate segmentation mask for the current frame.
We exploit this observation by using temporal information to quickly identify frames with minimal change.
We propose a novel dynamic network that estimates change across frames and decides which path -- computing a full network or reusing previous frame's feature -- to choose.
arXiv Detail & Related papers (2020-12-21T19:40:17Z) - Real-Time High-Performance Semantic Image Segmentation of Urban Street
Scenes [98.65457534223539]
We propose a real-time high-performance DCNN-based method for robust semantic segmentation of urban street scenes.
The proposed method achieves the accuracy of 73.6% and 68.0% mean Intersection over Union (mIoU) with the inference speed of 51.0 fps and 39.3 fps.
arXiv Detail & Related papers (2020-03-11T08:45:53Z) - Learning Fast and Robust Target Models for Video Object Segmentation [83.3382606349118]
Video object segmentation (VOS) is a highly challenging problem since the initial mask, defining the target object, is only given at test-time.
Most previous approaches fine-tune segmentation networks on the first frame, resulting in impractical frame-rates and risk of overfitting.
We propose a novel VOS architecture consisting of two network components.
arXiv Detail & Related papers (2020-02-27T21:58:06Z) - Efficient Video Semantic Segmentation with Labels Propagation and
Refinement [138.55845680523908]
This paper tackles the problem of real-time semantic segmentation of high definition videos using a hybrid GPU / CPU approach.
We propose an Efficient Video(EVS) pipeline that combines: (i) On the CPU, a very fast optical flow method, that is used to exploit the temporal aspect of the video and propagate semantic information from one frame to the next.
On the popular Cityscapes dataset with high resolution frames (2048 x 1024), the proposed operating points range from 80 to 1000 Hz on a single GPU and CPU.
arXiv Detail & Related papers (2019-12-26T11:45:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.