You Only Segment Once: Towards Real-Time Panoptic Segmentation
- URL: http://arxiv.org/abs/2303.14651v1
- Date: Sun, 26 Mar 2023 07:55:35 GMT
- Title: You Only Segment Once: Towards Real-Time Panoptic Segmentation
- Authors: Jie Hu, Linyan Huang, Tianhe Ren, Shengchuan Zhang, Rongrong Ji, and
Liujuan Cao
- Abstract summary: YOSO is a real-time panoptic segmentation framework.
YOSO predicts masks via dynamic convolutions between panoptic kernels and image feature maps.
YOSO achieves 46.4 PQ, 45.6 FPS on COCO; 52.5 PQ, 22.6 FPS on Cityscapes; 38.0 PQ, 35.4 FPS on ADE20K.
- Score: 68.91492389185744
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose YOSO, a real-time panoptic segmentation framework.
YOSO predicts masks via dynamic convolutions between panoptic kernels and image
feature maps, in which you only need to segment once for both instance and
semantic segmentation tasks. To reduce the computational overhead, we design a
feature pyramid aggregator for the feature map extraction, and a separable
dynamic decoder for the panoptic kernel generation. The aggregator
re-parameterizes interpolation-first modules in a convolution-first way, which
significantly speeds up the pipeline without any additional costs. The decoder
performs multi-head cross-attention via separable dynamic convolution for
better efficiency and accuracy. To the best of our knowledge, YOSO is the first
real-time panoptic segmentation framework that delivers competitive performance
compared to state-of-the-art models. Specifically, YOSO achieves 46.4 PQ, 45.6
FPS on COCO; 52.5 PQ, 22.6 FPS on Cityscapes; 38.0 PQ, 35.4 FPS on ADE20K; and
34.1 PQ, 7.1 FPS on Mapillary Vistas. Code is available at
https://github.com/hujiecpp/YOSO.
Related papers
- Mask Propagation for Efficient Video Semantic Segmentation [63.09523058489429]
Video Semantic baseline degradation (VSS) involves assigning a semantic label to each pixel in a video sequence.
We propose an efficient mask propagation framework for VSS, called SSSS.
Our framework reduces up to 4x FLOPs compared to the per-frame Mask2Former with only up to 2% mIoU on the Cityscapes validation set.
arXiv Detail & Related papers (2023-10-29T09:55:28Z) - Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen
Convolutional CLIP [28.103358632241104]
We propose to build everything into a single-stage framework using a shared Frozen Convolutional CLIP backbone.
FC-CLIP sets a new state-of-the-art performance across various open-vocabulary semantic segmentation datasets.
arXiv Detail & Related papers (2023-08-04T17:59:01Z) - Panoptic SegFormer [82.6258003344804]
We present Panoptic-SegFormer, a framework for end-to-end panoptic segmentation with Transformers.
With a ResNet-50 backbone, our method achieves 50.0% PQ on the COCO test-dev split.
Using a more powerful PVTv2-B5 backbone, Panoptic-SegFormer achieves a new record of 54.1%PQ and 54.4% PQ on the COCO val and test-dev splits with single scale input.
arXiv Detail & Related papers (2021-09-08T17:59:12Z) - Fully Convolutional Networks for Panoptic Segmentation [91.84686839549488]
We present a conceptually simple, strong, and efficient framework for panoptic segmentation, called Panoptic FCN.
Our approach aims to represent and predict foreground things and background stuff in a unified fully convolutional pipeline.
Panoptic FCN encodes each object instance or stuff category into a specific kernel weight with the proposed kernel generator.
arXiv Detail & Related papers (2020-12-01T18:31:41Z) - EOLO: Embedded Object Segmentation only Look Once [0.0]
We introduce an anchor-free and single-shot instance segmentation method, which is conceptually simple with 3 independent branches, fully convolutional and can be used by easily embedding it into mobile and embedded devices.
Our method, refer as EOLO, reformulates the instance segmentation problem as predicting semantic segmentation and distinguishing overlapping objects problem, through instance center classification and 4D distance regression on each pixel.
Without any bells and whistles, EOLO achieves 27.7$%$ in mask mAP under IoU50 and reaches 30 FPS on 1080Ti GPU, with a single-model and single-scale training/testing on
arXiv Detail & Related papers (2020-03-31T21:22:05Z) - EPSNet: Efficient Panoptic Segmentation Network with Cross-layer
Attention Fusion [5.815742965809424]
We propose an Efficient Panoptic Network (EPSNet) to tackle the panoptic segmentation tasks with fast inference speed.
Basically, EPSNet generates masks based on simple linear combination of prototype masks and mask coefficients.
To enhance the quality of shared prototypes, we adopt a module called "cross-layer attention fusion module"
arXiv Detail & Related papers (2020-03-23T09:11:44Z) - Unifying Training and Inference for Panoptic Segmentation [111.44758195510838]
We present an end-to-end network to bridge the gap between training and inference for panoptic segmentation.
Our system sets new records on the popular street scene dataset, Cityscapes, achieving 61.4 PQ with a ResNet-50 backbone.
Our network flexibly works with and without object mask cues, performing competitively under both settings.
arXiv Detail & Related papers (2020-01-14T18:58:24Z) - Efficient Video Semantic Segmentation with Labels Propagation and
Refinement [138.55845680523908]
This paper tackles the problem of real-time semantic segmentation of high definition videos using a hybrid GPU / CPU approach.
We propose an Efficient Video(EVS) pipeline that combines: (i) On the CPU, a very fast optical flow method, that is used to exploit the temporal aspect of the video and propagate semantic information from one frame to the next.
On the popular Cityscapes dataset with high resolution frames (2048 x 1024), the proposed operating points range from 80 to 1000 Hz on a single GPU and CPU.
arXiv Detail & Related papers (2019-12-26T11:45:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.