Video-kMaX: A Simple Unified Approach for Online and Near-Online Video
Panoptic Segmentation
- URL: http://arxiv.org/abs/2304.04694v1
- Date: Mon, 10 Apr 2023 16:17:25 GMT
- Title: Video-kMaX: A Simple Unified Approach for Online and Near-Online Video
Panoptic Segmentation
- Authors: Inkyu Shin, Dahun Kim, Qihang Yu, Jun Xie, Hong-Seok Kim, Bradley
Green, In So Kweon, Kuk-Jin Yoon, Liang-Chieh Chen
- Abstract summary: Video Panoptics (VPS) aims to achieve comprehensive pixel-level scene understanding by segmenting all pixels and associating objects in a video.
Current solutions can be categorized into online and near-online approaches.
We propose a unified approach for online and near-online VPS.
- Score: 104.27219170531059
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Video Panoptic Segmentation (VPS) aims to achieve comprehensive pixel-level
scene understanding by segmenting all pixels and associating objects in a
video. Current solutions can be categorized into online and near-online
approaches. Evolving over the time, each category has its own specialized
designs, making it nontrivial to adapt models between different categories. To
alleviate the discrepancy, in this work, we propose a unified approach for
online and near-online VPS. The meta architecture of the proposed Video-kMaX
consists of two components: within clip segmenter (for clip-level segmentation)
and cross-clip associater (for association beyond clips). We propose clip-kMaX
(clip k-means mask transformer) and HiLA-MB (Hierarchical Location-Aware Memory
Buffer) to instantiate the segmenter and associater, respectively. Our general
formulation includes the online scenario as a special case by adopting clip
length of one. Without bells and whistles, Video-kMaX sets a new
state-of-the-art on KITTI-STEP and VIPSeg for video panoptic segmentation, and
VSPW for video semantic segmentation. Code will be made publicly available.
Related papers
- Rethinking Video Segmentation with Masked Video Consistency: Did the Model Learn as Intended? [22.191260650245443]
Video segmentation aims at partitioning video sequences into meaningful segments based on objects or regions of interest within frames.
Current video segmentation models are often derived from image segmentation techniques, which struggle to cope with small-scale or class-imbalanced video datasets.
We propose a training strategy Masked Video Consistency, which enhances spatial and temporal feature aggregation.
arXiv Detail & Related papers (2024-08-20T08:08:32Z) - Scene Summarization: Clustering Scene Videos into Spatially Diverse
Frames [24.614476456145255]
We propose summarization as a new video-based scene understanding task.
It aims to summarize a long video walkthrough of a scene into a small set of frames that are spatially diverse in the scene.
Our solution is a two-stage self-supervised pipeline named SceneSum.
arXiv Detail & Related papers (2023-11-28T22:18:26Z) - Tracking Anything with Decoupled Video Segmentation [87.07258378407289]
We develop a decoupled video segmentation approach (DEVA)
It is composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation.
We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks.
arXiv Detail & Related papers (2023-09-07T17:59:41Z) - Per-Clip Video Object Segmentation [110.08925274049409]
Recently, memory-based approaches show promising results on semisupervised video object segmentation.
We treat video object segmentation as clip-wise mask-wise propagation.
We propose a new method tailored for the per-clip inference.
arXiv Detail & Related papers (2022-08-03T09:02:29Z) - One-stage Video Instance Segmentation: From Frame-in Frame-out to
Clip-in Clip-out [15.082477136581153]
We propose a clip-in clip-out (CiCo) framework to exploit temporal information in video clips.
CiCo strategy is free of interconditional-frame alignment, and can be easily embedded into existing FiFo based VIS approaches.
Two new one-stage VIS models achieve 37.7.3%, 35.2/35.4% and 17.2/1% mask AP.
arXiv Detail & Related papers (2022-03-12T12:23:21Z) - Mask2Former for Video Instance Segmentation [172.10001340104515]
Mask2Former achieves state-of-the-art performance on video segmentation instance without modifying the architecture, the loss or even the training pipeline.
We show universal image segmentation architectures trivially generalize to video segmentation by directly predicting 3D segmentation volumes.
arXiv Detail & Related papers (2021-12-20T18:59:59Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z) - Video Panoptic Segmentation [117.08520543864054]
We propose and explore a new video extension of this task, called video panoptic segmentation.
To invigorate research on this new task, we present two types of video panoptic datasets.
We propose a novel video panoptic segmentation network (VPSNet) which jointly predicts object classes, bounding boxes, masks, instance id tracking, and semantic segmentation in video frames.
arXiv Detail & Related papers (2020-06-19T19:35:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.