Video Instance Segmentation using Inter-Frame Communication Transformers
- URL: http://arxiv.org/abs/2106.03299v1
- Date: Mon, 7 Jun 2021 02:08:39 GMT
- Title: Video Instance Segmentation using Inter-Frame Communication Transformers
- Authors: Sukjun Hwang, Miran Heo, Seoung Wug Oh, Seon Joo Kim
- Abstract summary: Recently, the per-clip pipeline shows superior performance over per-frame methods.
Previous per-clip models require heavy computation and memory usage to achieve frame-to-frame communications.
We propose Inter-frame Communication Transformers (IFC), which significantly reduces the overhead for information-passing between frames.
- Score: 28.539742250704695
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel end-to-end solution for video instance segmentation (VIS)
based on transformers. Recently, the per-clip pipeline shows superior
performance over per-frame methods leveraging richer information from multiple
frames. However, previous per-clip models require heavy computation and memory
usage to achieve frame-to-frame communications, limiting practicality. In this
work, we propose Inter-frame Communication Transformers (IFC), which
significantly reduces the overhead for information-passing between frames by
efficiently encoding the context within the input clip. Specifically, we
propose to utilize concise memory tokens as a mean of conveying information as
well as summarizing each frame scene. The features of each frame are enriched
and correlated with other frames through exchange of information between the
precisely encoded memory tokens. We validate our method on the latest benchmark
sets and achieved the state-of-the-art performance (AP 44.6 on YouTube-VIS 2019
val set using the offline inference) while having a considerably fast runtime
(89.4 FPS). Our method can also be applied to near-online inference for
processing a video in real-time with only a small delay. The code will be made
available.
Related papers
- Space-time Reinforcement Network for Video Object Segmentation [16.67780344875854]
Video object segmentation (VOS) networks typically use memory-based methods.
These methods suffer from two issues: 1) Challenging data can destroy the space-time coherence between adjacent video frames, and 2) Pixel-level matching will lead to undesired mismatching.
In this paper, we propose to generate an auxiliary frame between adjacent frames, serving as an implicit short-temporal reference for the query one.
arXiv Detail & Related papers (2024-05-07T06:26:30Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders
Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - Aggregating Long-term Sharp Features via Hybrid Transformers for Video
Deblurring [76.54162653678871]
We propose a video deblurring method that leverages both neighboring frames and present sharp frames using hybrid Transformers for feature aggregation.
Our proposed method outperforms state-of-the-art video deblurring methods as well as event-driven video deblurring methods in terms of quantitative metrics and visual quality.
arXiv Detail & Related papers (2023-09-13T16:12:11Z) - Referred by Multi-Modality: A Unified Temporal Transformer for Video
Object Segmentation [54.58405154065508]
We propose a Multi-modal Unified Temporal transformer for Referring video object segmentation.
With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference.
For high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video.
arXiv Detail & Related papers (2023-05-25T17:59:47Z) - Per-Clip Video Object Segmentation [110.08925274049409]
Recently, memory-based approaches show promising results on semisupervised video object segmentation.
We treat video object segmentation as clip-wise mask-wise propagation.
We propose a new method tailored for the per-clip inference.
arXiv Detail & Related papers (2022-08-03T09:02:29Z) - Efficient Video Instance Segmentation via Tracklet Query and Proposal [62.897552852894854]
Video Instance aims to simultaneously classify, segment, and track multiple object instances in videos.
Most clip-level methods are neither end-to-end learnable nor real-time.
This paper proposes EfficientVIS, a fully end-to-end framework with efficient training and inference.
arXiv Detail & Related papers (2022-03-03T17:00:11Z) - No frame left behind: Full Video Action Recognition [26.37329995193377]
We propose full video action recognition and consider all video frames.
We first cluster all frame activations along the temporal dimension.
We then temporally aggregate the frames in the clusters into a smaller number of representations.
arXiv Detail & Related papers (2021-03-29T07:44:28Z) - Frame-To-Frame Consistent Semantic Segmentation [2.538209532048867]
We train a convolutional neural network (CNN) which propagates features through consecutive frames in a video.
Our results indicate that the added temporal information produces a frame-to-frame consistent and more accurate image understanding.
arXiv Detail & Related papers (2020-08-03T15:28:40Z) - Efficient Semantic Video Segmentation with Per-frame Inference [117.97423110566963]
In this work, we process efficient semantic video segmentation in a per-frame fashion during the inference process.
We employ compact models for real-time execution. To narrow the performance gap between compact models and large models, new knowledge distillation methods are designed.
arXiv Detail & Related papers (2020-02-26T12:24:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.