Related papers: Directional Deep Embedding and Appearance Learning for Fast Video Object Segmentation

Directional Deep Embedding and Appearance Learning for Fast Video Object Segmentation

URL: http://arxiv.org/abs/2002.06736v1
Date: Mon, 17 Feb 2020 01:51:57 GMT
Title: Directional Deep Embedding and Appearance Learning for Fast Video Object Segmentation
Authors: Yingjie Yin, De Xu, Xingang Wang and Lei Zhang
Abstract summary: We propose a directional deep embedding and YouTube appearance learning (DEmbed) method, which is free of the online fine-tuning process. Our method achieves a state-of-the-art VOS performance without using online fine-tuning.
Score: 11.10636117512819
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Most recent semi-supervised video object segmentation (VOS) methods rely on fine-tuning deep convolutional neural networks online using the given mask of the first frame or predicted masks of subsequent frames. However, the online fine-tuning process is usually time-consuming, limiting the practical use of such methods. We propose a directional deep embedding and appearance learning (DDEAL) method, which is free of the online fine-tuning process, for fast VOS. First, a global directional matching module, which can be efficiently implemented by parallel convolutional operations, is proposed to learn a semantic pixel-wise embedding as an internal guidance. Second, an effective directional appearance model based statistics is proposed to represent the target and background on a spherical embedding space for VOS. Equipped with the global directional matching module and the directional appearance model learning module, DDEAL learns static cues from the labeled first frame and dynamically updates cues of the subsequent frames for object segmentation. Our method exhibits state-of-the-art VOS performance without using online fine-tuning. Specifically, it achieves a J & F mean score of 74.8% on DAVIS 2017 dataset and an overall score G of 71.3% on the large-scale YouTube-VOS dataset, while retaining a speed of 25 fps with a single NVIDIA TITAN Xp GPU. Furthermore, our faster version runs 31 fps with only a little accuracy loss. Our code and trained networks are available at https://github.com/YingjieYin/Directional-Deep-Embedding-and-Appearance-Learning-for-Fast-Video-Obje ct-Segmentation.

Related papers

Online Unsupervised Video Object Segmentation via Contrastive Motion Clustering [27.265597448266988]
Online unsupervised video object segmentation (UVOS) uses the previous frames as its input to automatically separate the primary object(s) from a streaming video without using any further manual annotation. A major challenge is that the model has no access to the future and must rely solely on the history, i.e., the segmentation mask is predicted from the current frame as soon as it is captured. In this work, a novel contrastive motion clustering algorithm with an optical flow as its input is proposed for the online UVOS by exploiting the common fate principle that visual elements tend to be perceived as a group if they possess the same
arXiv Detail & Related papers (2023-06-21T06:40:31Z)
RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving [80.14669385741202]
Vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks. ViTs are notoriously hard to train and require a lot of training data to learn powerful representations. We show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and Semantic KITTI.
arXiv Detail & Related papers (2023-01-24T18:50:48Z)
Optimization Planning for 3D ConvNets [123.43419144051703]
It is not trivial to optimally learn a 3D Convolutional Neural Networks (3D ConvNets) due to high complexity and various options of the training scheme. We decompose the path into a series of training "states" and specify the hyper- parameters, e.g., learning rate and the length of input clips, in each state. We perform dynamic programming over all the candidate states to plan the optimal permutation of states, i.e., optimization path.
arXiv Detail & Related papers (2022-01-11T16:13:31Z)
FAMINet: Learning Real-time Semi-supervised Video Object Segmentation with Steepest Optimized Optical Flow [21.45623125216448]
Semi-supervised video object segmentation (VOS) aims to segment a few moving objects in a video sequence, where these objects are specified by annotation of first frame. The optical flow has been considered in many existing semi-supervised VOS methods to improve the segmentation accuracy. A FAMINet, which consists of a feature extraction network (F), an appearance network (A), a motion network (M), and an integration network (I), is proposed in this study to address the abovementioned problem.
arXiv Detail & Related papers (2021-11-20T07:24:33Z)
Learning Dynamic Network Using a Reuse Gate Function in Semi-supervised Video Object Segmentation [27.559093073097483]
Current approaches for Semi-supervised Video Object (Semi-VOS) propagates information from previous frames to generate segmentation mask for the current frame. We exploit this observation by using temporal information to quickly identify frames with minimal change. We propose a novel dynamic network that estimates change across frames and decides which path -- computing a full network or reusing previous frame's feature -- to choose.
arXiv Detail & Related papers (2020-12-21T19:40:17Z)
Make One-Shot Video Object Segmentation Efficient Again [7.7415390727490445]
Video object segmentation (VOS) describes the task of segmenting a set of objects in each frame of a video. e-OSVOS decouples the object detection task and predicts only local segmentation masks by applying a modified version of Mask R-CNN. e-OSVOS provides state-of-the-art results on DAVIS 2016, DAVIS 2017, and YouTube-VOS for one-shot fine-tuning methods.
arXiv Detail & Related papers (2020-12-03T12:21:23Z)
Fast Video Object Segmentation With Temporal Aggregation Network and Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS) We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance. We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)
Real-Time High-Performance Semantic Image Segmentation of Urban Street Scenes [98.65457534223539]
We propose a real-time high-performance DCNN-based method for robust semantic segmentation of urban street scenes. The proposed method achieves the accuracy of 73.6% and 68.0% mean Intersection over Union (mIoU) with the inference speed of 51.0 fps and 39.3 fps.
arXiv Detail & Related papers (2020-03-11T08:45:53Z)
Learning Fast and Robust Target Models for Video Object Segmentation [83.3382606349118]
Video object segmentation (VOS) is a highly challenging problem since the initial mask, defining the target object, is only given at test-time. Most previous approaches fine-tune segmentation networks on the first frame, resulting in impractical frame-rates and risk of overfitting. We propose a novel VOS architecture consisting of two network components.
arXiv Detail & Related papers (2020-02-27T21:58:06Z)
Efficient Video Semantic Segmentation with Labels Propagation and Refinement [138.55845680523908]
This paper tackles the problem of real-time semantic segmentation of high definition videos using a hybrid GPU / CPU approach. We propose an Efficient Video(EVS) pipeline that combines: (i) On the CPU, a very fast optical flow method, that is used to exploit the temporal aspect of the video and propagate semantic information from one frame to the next. On the popular Cityscapes dataset with high resolution frames (2048 x 1024), the proposed operating points range from 80 to 1000 Hz on a single GPU and CPU.
arXiv Detail & Related papers (2019-12-26T11:45:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.