GRAtt-VIS: Gated Residual Attention for Auto Rectifying Video Instance
Segmentation
- URL: http://arxiv.org/abs/2305.17096v1
- Date: Fri, 26 May 2023 17:10:24 GMT
- Title: GRAtt-VIS: Gated Residual Attention for Auto Rectifying Video Instance
Segmentation
- Authors: Tanveer Hannan, Rajat Koner, Maximilian Bernhard, Suprosanna Shit,
Bjoern Menze, Volker Tresp, Matthias Schubert, Thomas Seidl
- Abstract summary: Recent trends in Video Instance (VIS) have seen a growing reliance on online methods to model complex and lengthy video sequences.
Degradation of representation and noise accumulation pose substantial challenges.
We introduce textbfGRAtt-VIS, textbfGated textbfResidual textbfAttention for textbfVideo textbfInstance textbfSegmentation.
- Score: 20.70044082417488
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent trends in Video Instance Segmentation (VIS) have seen a growing
reliance on online methods to model complex and lengthy video sequences.
However, the degradation of representation and noise accumulation of the online
methods, especially during occlusion and abrupt changes, pose substantial
challenges. Transformer-based query propagation provides promising directions
at the cost of quadratic memory attention. However, they are susceptible to the
degradation of instance features due to the above-mentioned challenges and
suffer from cascading effects. The detection and rectification of such errors
remain largely underexplored. To this end, we introduce \textbf{GRAtt-VIS},
\textbf{G}ated \textbf{R}esidual \textbf{Att}ention for \textbf{V}ideo
\textbf{I}nstance \textbf{S}egmentation. Firstly, we leverage a
Gumbel-Softmax-based gate to detect possible errors in the current frame. Next,
based on the gate activation, we rectify degraded features from its past
representation. Such a residual configuration alleviates the need for dedicated
memory and provides a continuous stream of relevant instance features.
Secondly, we propose a novel inter-instance interaction using gate activation
as a mask for self-attention. This masking strategy dynamically restricts the
unrepresentative instance queries in the self-attention and preserves vital
information for long-term tracking. We refer to this novel combination of Gated
Residual Connection and Masked Self-Attention as \textbf{GRAtt} block, which
can easily be integrated into the existing propagation-based framework.
Further, GRAtt blocks significantly reduce the attention overhead and simplify
dynamic temporal modeling. GRAtt-VIS achieves state-of-the-art performance on
YouTube-VIS and the highly challenging OVIS dataset, significantly improving
over previous methods. Code is available at
\url{https://github.com/Tanveer81/GRAttVIS}.
Related papers
- DIFFVSGG: Diffusion-Driven Online Video Scene Graph Generation [61.59996525424585]
DIFFVSGG is an online VSGG solution that frames this task as an iterative scene graph update problem.
We unify the decoding of object classification, bounding box regression, and graph generation three tasks using one shared feature embedding.
DIFFVSGG further facilitates continuous temporal reasoning, where predictions for subsequent frames leverage results of past frames as the conditional inputs of LDMs.
arXiv Detail & Related papers (2025-03-18T06:49:51Z) - Improving Weakly-supervised Video Instance Segmentation by Leveraging Spatio-temporal Consistency [9.115508086522887]
We introduce a weakly-supervised method called Eigen VIS that achieves competitive accuracy compared to other VIS approaches.
This method is based on two key innovations: a Temporal Eigenvalue Loss (TEL) and a clip-level Quality Co-efficient (QCC)
The code is available on https://github.com/farnooshar/EigenVIS.
arXiv Detail & Related papers (2024-08-29T16:05:05Z) - DVIS-DAQ: Improving Video Segmentation via Dynamic Anchor Queries [60.09774333024783]
We introduce Dynamic Anchor Queries (DAQ) to shorten the transition gap between the anchor and target queries.
We also introduce a query-level object Emergence and Disappearance Simulation (EDS) strategy, which unleashes DAQ's potential without any additional cost.
Experiments demonstrate that DVIS-DAQ achieves a new state-of-the-art (SOTA) performance on five mainstream video segmentation benchmarks.
arXiv Detail & Related papers (2024-03-29T17:58:50Z) - Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation [76.68301884987348]
We propose a simple yet effective approach for self-supervised video object segmentation (VOS)
Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust-temporal segmentation correspondences in videos.
Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and excels in complex real-world multi-object video segmentation tasks.
arXiv Detail & Related papers (2023-11-29T18:47:17Z) - DVIS: Decoupled Video Instance Segmentation Framework [15.571072365208872]
Video instance segmentation (VIS) is a critical task with diverse applications, including autonomous driving and video editing.
Existing methods often underperform on complex and long videos in real world, primarily due to two factors.
We propose a decoupling strategy for VIS by dividing it into three independent sub-tasks: segmentation, tracking, and refinement.
arXiv Detail & Related papers (2023-06-06T05:24:15Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Mask-Free Video Instance Segmentation [102.50936366583106]
Video masks are tedious and expensive to annotate, limiting the scale and diversity of existing VIS datasets.
We propose MaskFreeVIS, achieving highly competitive VIS performance, while only using bounding box annotations for the object state.
Our TK-Loss finds one-to-many matches across frames, through an efficient patch-matching step followed by a K-nearest neighbor selection.
arXiv Detail & Related papers (2023-03-28T11:48:07Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - InstanceFormer: An Online Video Instance Segmentation Framework [21.760243214387987]
We propose a single-stage transformer-based efficient online VIS framework named InstanceFormer.
We propose three novel components to model short-term and long-term dependency and temporal coherence.
The proposed InstanceFormer outperforms previous online benchmark methods by a large margin across multiple datasets.
arXiv Detail & Related papers (2022-08-22T18:54:18Z) - Video Mask Transfiner for High-Quality Video Instance Segmentation [102.50936366583106]
Video Mask Transfiner (VMT) is capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure.
Based on our VMT architecture, we design an automated annotation refinement approach by iterative training and self-correction.
We compare VMT with the most recent state-of-the-art methods on the HQ-YTVIS, as well as the Youtube-VIS, OVIS and BDD100K MOTS.
arXiv Detail & Related papers (2022-07-28T11:13:37Z) - MUNet: Motion Uncertainty-aware Semi-supervised Video Object
Segmentation [31.100954335785026]
We advocate the return of the emphmotion information and propose a motion uncertainty-aware framework (MUNet) for semi-supervised video object segmentation.
We introduce a motion-aware spatial attention module to effectively fuse the motion feature with the semantic feature.
We achieve $76.5%$ $mathcalJ & mathcalF$ only using DAVIS17 for training, which significantly outperforms the textitSOTA methods under the low-data protocol.
arXiv Detail & Related papers (2021-11-29T16:01:28Z) - Spatiotemporal Graph Neural Network based Mask Reconstruction for Video
Object Segmentation [70.97625552643493]
This paper addresses the task of segmenting class-agnostic objects in semi-supervised setting.
We propose a novel graph neuralS network (TG-Net) which captures the local contexts by utilizing all proposals.
arXiv Detail & Related papers (2020-12-10T07:57:44Z) - Learn to cycle: Time-consistent feature discovery for action recognition [83.43682368129072]
Generalizing over temporal variations is a prerequisite for effective action recognition in videos.
We introduce Squeeze Re Temporal Gates (SRTG), an approach that favors temporal activations with potential variations.
We show consistent improvement when using SRTPG blocks, with only a minimal increase in the number of GFLOs.
arXiv Detail & Related papers (2020-06-15T09:36:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.