Improving Video Instance Segmentation via Temporal Pyramid Routing
- URL: http://arxiv.org/abs/2107.13155v1
- Date: Wed, 28 Jul 2021 03:57:12 GMT
- Title: Improving Video Instance Segmentation via Temporal Pyramid Routing
- Authors: Xiangtai Li, Hao He, Henghui Ding, Kuiyuan Yang, Guangliang Cheng,
Jianping Shi, Yunhai Tong
- Abstract summary: Video Instance (VIS) is a new and inherently multi-task problem, which aims to detect, segment and track each instance in a video sequence.
We propose a Temporal Pyramid Routing (TPR) strategy to conditionally align and conduct pixel-level aggregation from a feature pyramid pair of two adjacent frames.
Our approach is a plug-and-play module and can be easily applied to existing instance segmentation methods.
- Score: 61.10753640148878
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Instance Segmentation (VIS) is a new and inherently multi-task problem,
which aims to detect, segment and track each instance in a video sequence.
Existing approaches are mainly based on single-frame features or single-scale
features of multiple frames, where temporal information or multi-scale
information is ignored. To incorporate both temporal and scale information, we
propose a Temporal Pyramid Routing (TPR) strategy to conditionally align and
conduct pixel-level aggregation from a feature pyramid pair of two adjacent
frames. Specifically, TPR contains two novel components, including Dynamic
Aligned Cell Routing (DACR) and Cross Pyramid Routing (CPR), where DACR is
designed for aligning and gating pyramid features across temporal dimension,
while CPR transfers temporally aggregated features across scale dimension.
Moreover, our approach is a plug-and-play module and can be easily applied to
existing instance segmentation methods. Extensive experiments on YouTube-VIS
dataset demonstrate the effectiveness and efficiency of the proposed approach
on several state-of-the-art instance segmentation methods. Codes and trained
models will be publicly available to facilitate future
research.(\url{https://github.com/lxtGH/TemporalPyramidRouting}).
Related papers
- Temporally Consistent Referring Video Object Segmentation with Hybrid Memory [98.80249255577304]
We propose an end-to-end R-VOS paradigm that explicitly models temporal consistency alongside the referring segmentation.
Features of frames with automatically generated high-quality reference masks are propagated to segment remaining frames.
Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin.
arXiv Detail & Related papers (2024-03-28T13:32:49Z) - Multi-grained Temporal Prototype Learning for Few-shot Video Object
Segmentation [156.4142424784322]
Few-Shot Video Object (FSVOS) aims to segment objects in a query video with the same category defined by a few annotated support images.
We propose to leverage multi-grained temporal guidance information for handling the temporal correlation nature of video data.
Our proposed video IPMT model significantly outperforms previous models on two benchmark datasets.
arXiv Detail & Related papers (2023-09-20T09:16:34Z) - UMMAFormer: A Universal Multimodal-adaptive Transformer Framework for
Temporal Forgery Localization [16.963092523737593]
We propose a novel framework for temporal forgery localization (TFL) that predicts forgery segments with multimodal adaptation.
Our approach achieves state-of-the-art performance on benchmark datasets, including Lav-DF, TVIL, and Psynd.
arXiv Detail & Related papers (2023-08-28T08:20:30Z) - Task-Specific Alignment and Multiple Level Transformer for Few-Shot
Action Recognition [11.700737340560796]
In recent years, some works have used the Transformer to deal with frames, then get the attention feature and the enhanced prototype, and the results are competitive.
We address these problems through an end-to-end method named "Task-Specific Alignment and Multiple-level Transformer Network (TSA-MLT)"
Our method achieves state-of-the-art results on the HMDB51 and UCF101 datasets and a competitive result on the benchmark of Kinetics and something 2-something V2 datasets.
arXiv Detail & Related papers (2023-07-05T02:13:25Z) - STC: Spatio-Temporal Contrastive Learning for Video Instance
Segmentation [47.28515170195206]
Video Instance (VIS) is a task that simultaneously requires classification, segmentation, and instance association in a video.
Recent VIS approaches rely on sophisticated pipelines to achieve this goal, including RoI-related operations or 3D convolutions.
We present a simple and efficient single-stage VIS framework based on the instance segmentation method ConInst.
arXiv Detail & Related papers (2022-02-08T09:34:26Z) - Prototypical Cross-Attention Networks for Multiple Object Tracking and
Segmentation [95.74244714914052]
Multiple object tracking and segmentation requires detecting, tracking, and segmenting objects belonging to a set of given classes.
We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich-temporal information online.
PCAN outperforms current video instance tracking and segmentation competition winners on Youtube-VIS and BDD100K datasets.
arXiv Detail & Related papers (2021-06-22T17:57:24Z) - Video Instance Segmentation with a Propose-Reduce Paradigm [68.59137660342326]
Video instance segmentation (VIS) aims to segment and associate all instances of predefined classes for each frame in videos.
Prior methods usually obtain segmentation for a frame or clip first, and then merge the incomplete results by tracking or matching.
We propose a new paradigm -- Propose-Reduce, to generate complete sequences for input videos by a single step.
arXiv Detail & Related papers (2021-03-25T10:58:36Z) - End-to-End Video Instance Segmentation with Transformers [84.17794705045333]
Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video.
Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem.
For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy.
arXiv Detail & Related papers (2020-11-30T02:03:50Z) - STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos [17.232631075144592]
Methods for instance segmentation in videos typically follow the tracking-by-detection paradigm.
We propose a novel approach that segments and tracks instances across space and time in a single stage.
Our method achieves state-of-the-art results across multiple datasets and tasks.
arXiv Detail & Related papers (2020-03-18T18:40:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.