Video K-Net: A Simple, Strong, and Unified Baseline for Video
Segmentation
- URL: http://arxiv.org/abs/2204.04656v1
- Date: Sun, 10 Apr 2022 11:24:47 GMT
- Title: Video K-Net: A Simple, Strong, and Unified Baseline for Video
Segmentation
- Authors: Xiangtai Li, Wenwei Zhang, Jiangmiao Pang, Kai Chen, Guangliang Cheng,
Yunhai Tong, Chen Change Loy
- Abstract summary: Video K-Net is a framework for end-to-end video panoptic segmentation.
It unifies image segmentation via a group of learnable kernels.
Video K-Net learns to simultaneously segment and track "things" and "stuff"
- Score: 85.08156742410527
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents Video K-Net, a simple, strong, and unified framework for
fully end-to-end video panoptic segmentation. The method is built upon K-Net, a
method that unifies image segmentation via a group of learnable kernels. We
observe that these learnable kernels from K-Net, which encode object
appearances and contexts, can naturally associate identical instances across
video frames. Motivated by this observation, Video K-Net learns to
simultaneously segment and track "things" and "stuff" in a video with simple
kernel-based appearance modeling and cross-temporal kernel interaction. Despite
the simplicity, it achieves state-of-the-art video panoptic segmentation
results on Citscapes-VPS and KITTI-STEP without bells and whistles. In
particular on KITTI-STEP, the simple method can boost almost 12\% relative
improvements over previous methods. We also validate its generalization on
video semantic segmentation, where we boost various baselines by 2\% on the
VSPW dataset. Moreover, we extend K-Net into clip-level video framework for
video instance segmentation where we obtain 40.5\% for ResNet50 backbone and
51.5\% mAP for Swin-base on YouTube-2019 validation set. We hope this simple
yet effective method can serve as a new flexible baseline in video
segmentation. Both code and models are released at
https://github.com/lxtGH/Video-K-Net
Related papers
- DVIS++: Improved Decoupled Framework for Universal Video Segmentation [30.703276476607545]
We present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
By integrating CLIP with DVIS++, we present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
arXiv Detail & Related papers (2023-12-20T03:01:33Z) - Tracking Anything with Decoupled Video Segmentation [87.07258378407289]
We develop a decoupled video segmentation approach (DEVA)
It is composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation.
We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks.
arXiv Detail & Related papers (2023-09-07T17:59:41Z) - You Only Segment Once: Towards Real-Time Panoptic Segmentation [68.91492389185744]
YOSO is a real-time panoptic segmentation framework.
YOSO predicts masks via dynamic convolutions between panoptic kernels and image feature maps.
YOSO achieves 46.4 PQ, 45.6 FPS on COCO; 52.5 PQ, 22.6 FPS on Cityscapes; 38.0 PQ, 35.4 FPS on ADE20K.
arXiv Detail & Related papers (2023-03-26T07:55:35Z) - Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model.
Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
arXiv Detail & Related papers (2022-12-06T18:59:58Z) - End-to-end video instance segmentation via spatial-temporal graph neural
networks [30.748756362692184]
Video instance segmentation is a challenging task that extends image instance segmentation to the video domain.
Existing methods either rely only on single-frame information for the detection and segmentation subproblems or handle tracking as a separate post-processing step.
We propose a novel graph-neural-network (GNN) based method to handle the aforementioned limitation.
arXiv Detail & Related papers (2022-03-07T05:38:08Z) - Efficient Video Instance Segmentation via Tracklet Query and Proposal [62.897552852894854]
Video Instance aims to simultaneously classify, segment, and track multiple object instances in videos.
Most clip-level methods are neither end-to-end learnable nor real-time.
This paper proposes EfficientVIS, a fully end-to-end framework with efficient training and inference.
arXiv Detail & Related papers (2022-03-03T17:00:11Z) - Efficient Video Object Segmentation with Compressed Video [36.192735485675286]
We propose an efficient framework for semi-supervised video object segmentation by exploiting the temporal redundancy of the video.
Our method performs inference on selected vectors and makes predictions for other frames via propagation based on motion and residuals from the compressed video bitstream.
With STM with top-k filtering as our base model, we achieved highly competitive results on DAVIS16 and YouTube-VOS with substantial speedups of up to 4.9X with little loss in accuracy.
arXiv Detail & Related papers (2021-07-26T12:57:04Z) - K-Net: Towards Unified Image Segmentation [78.32096542571257]
The framework, named K-Net, segments both instances and semantic categories consistently by a group of learnable kernels.
K-Net can be trained in an end-to-end manner with bipartite matching, and its training and inference are naturally NMS-free and box-free.
arXiv Detail & Related papers (2021-06-28T17:18:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.