Time-Space Transformers for Video Panoptic Segmentation
- URL: http://arxiv.org/abs/2210.03546v1
- Date: Fri, 7 Oct 2022 13:30:11 GMT
- Title: Time-Space Transformers for Video Panoptic Segmentation
- Authors: Andra Petrovai and Sergiu Nedevschi
- Abstract summary: We propose a solution that simultaneously predicts pixel-level semantic and clip-level instance segmentation.
Our network, named VPS-Transformer, combines a convolutional architecture for single-frame panoptic segmentation and a video module based on an instantiation of a pure Transformer block.
- Score: 3.2489082010225494
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a novel solution for the task of video panoptic segmentation, that
simultaneously predicts pixel-level semantic and instance segmentation and
generates clip-level instance tracks. Our network, named VPS-Transformer, with
a hybrid architecture based on the state-of-the-art panoptic segmentation
network Panoptic-DeepLab, combines a convolutional architecture for
single-frame panoptic segmentation and a novel video module based on an
instantiation of the pure Transformer block. The Transformer, equipped with
attention mechanisms, models spatio-temporal relations between backbone output
features of current and past frames for more accurate and consistent panoptic
estimates. As the pure Transformer block introduces large computation overhead
when processing high resolution images, we propose a few design changes for a
more efficient compute. We study how to aggregate information more effectively
over the space-time volume and we compare several variants of the Transformer
block with different attention schemes. Extensive experiments on the
Cityscapes-VPS dataset demonstrate that our best model improves the temporal
consistency and video panoptic quality by a margin of 2.2%, with little extra
computation.
Related papers
- Video Quality Assessment Based on Swin TransformerV2 and Coarse to Fine
Strategy [16.436012370209845]
objective of non-reference quality assessment is to evaluate the quality of distorted video without access to high-definition references.
In this study, we introduce an enhanced spatial perception module, pre-trained on multiple image quality assessment datasets, and a lightweight temporal fusion module.
arXiv Detail & Related papers (2024-01-16T17:33:54Z) - Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation [59.91357714415056]
We propose two Transformer variants: Context-Sharing Transformer (CST) and Semantic Gathering-Scattering Transformer (S GST)
CST learns the global-shared contextual information within image frames with a lightweight computation; S GST models the semantic correlation separately for the foreground and background.
Compared with the baseline that uses vanilla Transformers for multi-stage fusion, ours significantly increase the speed by 13 times and achieves new state-of-the-art ZVOS performance.
arXiv Detail & Related papers (2023-08-13T06:12:00Z) - Dual-path Adaptation from Image to Video Transformers [62.056751480114784]
We efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters.
We propose a novel DualPath adaptation separated into spatial and temporal adaptation paths, where a lightweight bottleneck adapter is employed in each transformer block.
arXiv Detail & Related papers (2023-03-17T09:37:07Z) - Video-TransUNet: Temporally Blended Vision Transformer for CT VFSS
Instance Segmentation [11.575821326313607]
We propose Video-TransUNet, a deep architecture for segmentation in medical CT videos constructed by integrating temporal feature blending into the TransUNet deep learning framework.
In particular, our approach amalgamates strong frame representation via a ResNet CNN backbone, multi-frame feature blending via a Temporal Context Module, and reconstructive capabilities for multiple targets via a UNet-based convolutional-deconal architecture with multiple heads.
arXiv Detail & Related papers (2022-08-17T14:28:58Z) - Video Frame Interpolation with Transformer [55.12620857638253]
We introduce a novel framework, which takes advantage of Transformer to model long-range pixel correlation among video frames.
Our network is equipped with a novel cross-scale window-based attention mechanism, where cross-scale windows interact with each other.
arXiv Detail & Related papers (2022-05-15T09:30:28Z) - Self-supervised Video Transformer [46.295395772938214]
From a given video, we create local and global views with varying spatial sizes and frame rates.
Our self-supervised objective seeks to match the features of different views representing the same video to be intemporal.
Our approach performs well on four action benchmarks and converges faster with small batch sizes.
arXiv Detail & Related papers (2021-12-02T18:59:02Z) - Video Frame Interpolation Transformer [86.20646863821908]
We propose a Transformer-based video framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations.
To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video.
In addition, we develop a multi-scale frame scheme to fully realize the potential of Transformers.
arXiv Detail & Related papers (2021-11-27T05:35:10Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation [24.884078497381633]
We introduce a Transformer-based approach to video object segmentation (VOS)
Our attention-based approach allows a model to learn to attend over a history features of multiple frames.
Our method achieves competitive results on YouTube-VOS and DAVIS 2017 with improved scalability and robustness compared with the state of the art.
arXiv Detail & Related papers (2021-01-21T20:06:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.