Depth-aware Test-Time Training for Zero-shot Video Object Segmentation
- URL: http://arxiv.org/abs/2403.04258v1
- Date: Thu, 7 Mar 2024 06:40:53 GMT
- Title: Depth-aware Test-Time Training for Zero-shot Video Object Segmentation
- Authors: Weihuang Liu, Xi Shen, Haolun Li, Xiuli Bi, Bo Liu, Chi-Man Pun,
Xiaodong Cun
- Abstract summary: We introduce a test-time training (TTT) strategy to address the problem of generalization to unseen videos.
Our key insight is to enforce the model to predict consistent depth during the TTT process.
Our proposed video TTT strategy provides significant superiority over state-of-the-art TTT methods.
- Score: 48.2238806766877
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Zero-shot Video Object Segmentation (ZSVOS) aims at segmenting the primary
moving object without any human annotations. Mainstream solutions mainly focus
on learning a single model on large-scale video datasets, which struggle to
generalize to unseen videos. In this work, we introduce a test-time training
(TTT) strategy to address the problem. Our key insight is to enforce the model
to predict consistent depth during the TTT process. In detail, we first train a
single network to perform both segmentation and depth prediction tasks. This
can be effectively learned with our specifically designed depth modulation
layer. Then, for the TTT process, the model is updated by predicting consistent
depth maps for the same frame under different data augmentations. In addition,
we explore different TTT weight updating strategies. Our empirical results
suggest that the momentum-based weight initialization and looping-based
training scheme lead to more stable improvements. Experiments show that the
proposed method achieves clear improvements on ZSVOS. Our proposed video TTT
strategy provides significant superiority over state-of-the-art TTT methods.
Our code is available at: https://nifangbaage.github.io/DATTT.
Related papers
- Test-Time Training on Video Streams [54.07009446207442]
Prior work has established test-time training (TTT) as a general framework to further improve a trained model at test time.
We extend TTT to the streaming setting, where multiple test instances arrive in temporal order.
Online TTT significantly outperforms the fixed-model baseline for four tasks, on three real-world datasets.
arXiv Detail & Related papers (2023-07-11T05:17:42Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - Curriculum Learning for Recurrent Video Object Segmentation [2.3376061255029064]
This work explores different schedule sampling and frame skipping variations to significantly improve the performance of a recurrent architecture.
Our results on the car class of the KITTI-MOTS challenge indicate that, surprisingly, an inverse schedule sampling is a better option than a classic forward one.
arXiv Detail & Related papers (2020-08-15T10:51:22Z) - Dynamic Scale Training for Object Detection [111.33112051962514]
We propose a Dynamic Scale Training paradigm (abbreviated as DST) to mitigate scale variation challenge in object detection.
Experimental results demonstrate the efficacy of our proposed DST towards scale variation handling.
It does not introduce inference overhead and could serve as a free lunch for general detection configurations.
arXiv Detail & Related papers (2020-04-26T16:48:17Z) - Dense Regression Network for Video Grounding [97.57178850020327]
We use the distances between the frame within the ground truth and the starting (ending) frame as dense supervisions to improve the video grounding accuracy.
Specifically, we design a novel dense regression network (DRN) to regress the distances from each frame to the starting (ending) frame of the video segment.
We also propose a simple but effective IoU regression head module to explicitly consider the localization quality of the grounding results.
arXiv Detail & Related papers (2020-04-07T17:15:37Z) - Rethinking Zero-shot Video Classification: End-to-end Training for
Realistic Applications [26.955001807330497]
Zero-shot learning (ZSL) trains a model once and generalizes to new tasks whose classes are not present in the training dataset.
We propose the first end-to-end algorithm for ZSL in video classification.
Our training procedure builds on insights from recent video classification literature and uses a trainable 3D CNN to learn the visual features.
arXiv Detail & Related papers (2020-03-03T11:09:59Z) - Learning Fast and Robust Target Models for Video Object Segmentation [83.3382606349118]
Video object segmentation (VOS) is a highly challenging problem since the initial mask, defining the target object, is only given at test-time.
Most previous approaches fine-tune segmentation networks on the first frame, resulting in impractical frame-rates and risk of overfitting.
We propose a novel VOS architecture consisting of two network components.
arXiv Detail & Related papers (2020-02-27T21:58:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.