A Temporal Modeling Framework for Video Pre-Training on Video Instance Segmentation
- URL: http://arxiv.org/abs/2503.17672v1
- Date: Sat, 22 Mar 2025 07:01:25 GMT
- Title: A Temporal Modeling Framework for Video Pre-Training on Video Instance Segmentation
- Authors: Qing Zhong, Peng-Tao Jiang, Wen Wang, Guodong Ding, Lin Wu, Kaiqi Huang,
- Abstract summary: We present a novel video pre-training approach to enhance VIS models.<n>Our innovation focuses on reducing disparities between the pre-training and fine-tuning stages.<n>Our approach achieves a notable 4.0% increase in average precision on the challenging OVIS dataset.
- Score: 32.65111815620319
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contemporary Video Instance Segmentation (VIS) methods typically adhere to a pre-train then fine-tune regime, where a segmentation model trained on images is fine-tuned on videos. However, the lack of temporal knowledge in the pre-trained model introduces a domain gap which may adversely affect the VIS performance. To effectively bridge this gap, we present a novel video pre-training approach to enhance VIS models, especially for videos with intricate instance relationships. Our crucial innovation focuses on reducing disparities between the pre-training and fine-tuning stages. Specifically, we first introduce consistent pseudo-video augmentations to create diverse pseudo-video samples for pre-training while maintaining the instance consistency across frames. Then, we incorporate a multi-scale temporal module to enhance the model's ability to model temporal relations through self- and cross-attention at short- and long-term temporal spans. Our approach does not set constraints on model architecture and can integrate seamlessly with various VIS methods. Experiment results on commonly adopted VIS benchmarks show that our method consistently outperforms state-of-the-art methods. Our approach achieves a notable 4.0% increase in average precision on the challenging OVIS dataset.
Related papers
- Learning from One Continuous Video Stream [70.30084026960819]
We introduce a framework for online learning from a single continuous video stream.
This poses great challenges given the high correlation between consecutive video frames.
We employ pixel-to-pixel modelling as a practical and flexible way to switch between pre-training and single-stream evaluation.
arXiv Detail & Related papers (2023-12-01T14:03:30Z) - CTVIS: Consistent Training for Online Video Instance Segmentation [62.957370691452844]
Discrimination of instance embeddings plays a vital role in associating instances across time for online video instance segmentation (VIS)
Recent online VIS methods leverage CIs sourced from one reference frame only, which we argue is insufficient for learning highly discriminative embeddings.
We propose a simple yet effective training strategy, called Consistent Training for Online VIS (CTVIS), which devotes to aligning the training and inference pipelines.
arXiv Detail & Related papers (2023-07-24T08:44:25Z) - Self-Supervised Video Representation Learning via Latent Time Navigation [12.721647696921865]
Self-supervised video representation learning aims at maximizing similarity between different temporal segments of one video.
We propose Latent Time Navigation (LTN) to capture fine-grained motions.
Our experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification.
arXiv Detail & Related papers (2023-05-10T20:06:17Z) - Implicit Temporal Modeling with Learnable Alignment for Video
Recognition [95.82093301212964]
We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance.
ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
arXiv Detail & Related papers (2023-04-20T17:11:01Z) - Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models.
We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z) - TCGL: Temporal Contrastive Graph for Self-supervised Video
Representation Learning [79.77010271213695]
We propose a novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL)
Our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter- snippet Temporal Contrastive Graphs (TCG)
To generate supervisory signals for unlabeled videos, we introduce an Adaptive Snippet Order Prediction (ASOP) module.
arXiv Detail & Related papers (2021-12-07T09:27:56Z) - Learning Video Instance Segmentation with Recurrent Graph Neural
Networks [39.06202374530647]
We propose a novel learning formulation, where the entire video instance segmentation problem is modelled jointly.
We fit a flexible model to our formulation that, with the help of a graph neural network, processes all available new information in each frame.
Our approach, operating at over 25 FPS, outperforms previous video real-time methods.
arXiv Detail & Related papers (2020-12-07T18:41:35Z) - Efficient Semantic Video Segmentation with Per-frame Inference [117.97423110566963]
In this work, we process efficient semantic video segmentation in a per-frame fashion during the inference process.
We employ compact models for real-time execution. To narrow the performance gap between compact models and large models, new knowledge distillation methods are designed.
arXiv Detail & Related papers (2020-02-26T12:24:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.