Towards Robust Video Instance Segmentation with Temporal-Aware
Transformer
- URL: http://arxiv.org/abs/2301.09416v1
- Date: Fri, 20 Jan 2023 05:22:16 GMT
- Title: Towards Robust Video Instance Segmentation with Temporal-Aware
Transformer
- Authors: Zhenghao Zhang and Fangtao Shao and Zuozhuo Dai and Siyu Zhu
- Abstract summary: We propose TAFormer to aggregate-aware temporal features in encoder and decoder.
TAFormer effectively leverages the spatial and temporal information to obtain context-aware feature representation and outperforms state-of-the-art methods.
- Score: 12.81807735850422
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most existing transformer based video instance segmentation methods extract
per frame features independently, hence it is challenging to solve the
appearance deformation problem. In this paper, we observe the temporal
information is important as well and we propose TAFormer to aggregate
spatio-temporal features both in transformer encoder and decoder. Specifically,
in transformer encoder, we propose a novel spatio-temporal joint multi-scale
deformable attention module which dynamically integrates the spatial and
temporal information to obtain enriched spatio-temporal features. In
transformer decoder, we introduce a temporal self-attention module to enhance
the frame level box queries with the temporal relation. Moreover, TAFormer
adopts an instance level contrastive loss to increase the discriminability of
instance query embeddings. Therefore the tracking error caused by visually
similar instances can be decreased. Experimental results show that TAFormer
effectively leverages the spatial and temporal information to obtain
context-aware feature representation and outperforms state-of-the-art methods.
Related papers
- Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators [83.48423407316713]
We present a novel diffusion transformer framework incorporating an additional set of mediator tokens to engage with queries and keys separately.
Our model initiates the denoising process with a precise, non-ambiguous stage and gradually transitions to a phase enriched with detail.
Our method achieves a state-of-the-art FID score of 2.01 when integrated with the recent work SiT.
arXiv Detail & Related papers (2024-08-11T07:01:39Z) - Multi-Scale Temporal Difference Transformer for Video-Text Retrieval [10.509598789325782]
We propose a transformer variant named Multi-Scale Temporal Difference Transformer (MSTDT)
MSTDT mainly addresses the defects of the traditional transformer which has limited ability to capture local temporal information.
In general, our proposed MSTDT consists of a short-term multi-scale temporal difference transformer and a long-term temporal transformer.
arXiv Detail & Related papers (2024-06-23T13:59:31Z) - Correlated Attention in Transformers for Multivariate Time Series [22.542109523780333]
We propose a novel correlated attention mechanism, which efficiently captures feature-wise dependencies, and can be seamlessly integrated within the encoder blocks of existing Transformers.
In particular, correlated attention operates across feature channels to compute cross-covariance matrices between queries and keys with different lag values, and selectively aggregate representations at the sub-series level.
This architecture facilitates automated discovery and representation learning of not only instantaneous but also lagged cross-correlations, while inherently capturing time series auto-correlation.
arXiv Detail & Related papers (2023-11-20T17:35:44Z) - UMMAFormer: A Universal Multimodal-adaptive Transformer Framework for
Temporal Forgery Localization [16.963092523737593]
We propose a novel framework for temporal forgery localization (TFL) that predicts forgery segments with multimodal adaptation.
Our approach achieves state-of-the-art performance on benchmark datasets, including Lav-DF, TVIL, and Psynd.
arXiv Detail & Related papers (2023-08-28T08:20:30Z) - Diagnostic Spatio-temporal Transformer with Faithful Encoding [54.02712048973161]
This paper addresses the task of anomaly diagnosis when the underlying data generation process has a complex-temporal (ST) dependency.
We formalize the problem as supervised dependency discovery, where the ST dependency is learned as a side product of time-series classification.
We show that temporal positional encoding used in existing ST transformer works has a serious limitation capturing frequencies in higher frequencies (short time scales)
We also propose a new ST dependency discovery framework, which can provide readily consumable diagnostic information in both spatial and temporal directions.
arXiv Detail & Related papers (2023-05-26T05:31:23Z) - PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards
Video Object Detection [28.879484515844375]
We introduce a progressive way to introduce both temporal information and spatial information for an integrated enhancement.
PTSEFormer follows an end-to-end fashion to avoid heavy post-processing procedures while achieving 88.1% mAP on the ImageNet VID dataset.
arXiv Detail & Related papers (2022-09-06T06:32:57Z) - DisCoVQA: Temporal Distortion-Content Transformers for Video Quality
Assessment [56.42140467085586]
Some temporal variations are causing temporal distortions and lead to extra quality degradations.
Human visual system often has different attention to frames with different contents.
We propose a novel and effective transformer-based VQA method to tackle these two issues.
arXiv Detail & Related papers (2022-06-20T15:31:27Z) - Video Frame Interpolation Transformer [86.20646863821908]
We propose a Transformer-based video framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations.
To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video.
In addition, we develop a multi-scale frame scheme to fully realize the potential of Transformers.
arXiv Detail & Related papers (2021-11-27T05:35:10Z) - CCVS: Context-aware Controllable Video Synthesis [95.22008742695772]
presentation introduces a self-supervised learning approach to the synthesis of new video clips from old ones.
It conditions the synthesis process on contextual information for temporal continuity and ancillary information for fine control.
arXiv Detail & Related papers (2021-07-16T17:57:44Z) - Augmented Transformer with Adaptive Graph for Temporal Action Proposal
Generation [79.98992138865042]
We present an augmented transformer with adaptive graph network (ATAG) to exploit both long-range and local temporal contexts for TAPG.
Specifically, we enhance the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer.
An adaptive graph convolutional network (GCN) is proposed to build local temporal context by mining the position information and difference between adjacent features.
arXiv Detail & Related papers (2021-03-30T02:01:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.