Temporal Transductive Inference for Few-Shot Video Object Segmentation
- URL: http://arxiv.org/abs/2203.14308v2
- Date: Sun, 16 Jul 2023 13:31:17 GMT
- Title: Temporal Transductive Inference for Few-Shot Video Object Segmentation
- Authors: Mennatullah Siam, Konstantinos G. Derpanis, Richard P. Wildes
- Abstract summary: Few-shot object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training.
Key to our approach is the use of both global and local temporal constraints.
Empirically, our model outperforms state-of-the-art meta-learning approaches in terms of mean intersection over union on YouTube-VIS by 2.8%.
- Score: 27.140141181513425
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Few-shot video object segmentation (FS-VOS) aims at segmenting video frames
using a few labelled examples of classes not seen during initial training. In
this paper, we present a simple but effective temporal transductive inference
(TTI) approach that leverages temporal consistency in the unlabelled video
frames during few-shot inference. Key to our approach is the use of both global
and local temporal constraints. The objective of the global constraint is to
learn consistent linear classifiers for novel classes across the image
sequence, whereas the local constraint enforces the proportion of
foreground/background regions in each frame to be coherent across a local
temporal window. These constraints act as spatiotemporal regularizers during
the transductive inference to increase temporal coherence and reduce
overfitting on the few-shot support set. Empirically, our model outperforms
state-of-the-art meta-learning approaches in terms of mean intersection over
union on YouTube-VIS by 2.8%. In addition, we introduce improved benchmarks
that are exhaustively labelled (i.e. all object occurrences are labelled,
unlike the currently available), and present a more realistic evaluation
paradigm that targets data distribution shift between training and testing
sets. Our empirical results and in-depth analysis confirm the added benefits of
the proposed spatiotemporal regularizers to improve temporal coherence and
overcome certain overfitting scenarios.
Related papers
- Frame Order Matters: A Temporal Sequence-Aware Model for Few-Shot Action Recognition [14.97527336050901]
We propose a novel Temporal Sequence-Aware Model (TSAM) for few-shot action recognition (FSAR)
It incorporates a sequential perceiver adapter into the pre-training framework, to integrate both the spatial information and the sequential temporal dynamics into the feature embeddings.
Experimental results on five FSAR datasets demonstrate that our method set a new benchmark, beating the second-best competitors with large margins.
arXiv Detail & Related papers (2024-08-22T15:13:27Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Video Activity Localisation with Uncertainties in Temporal Boundary [74.7263952414899]
Methods for video activity localisation over time assume implicitly that activity temporal boundaries are determined and precise.
In unscripted natural videos, different activities transit smoothly, so that it is intrinsically ambiguous to determine in labelling precisely when an activity starts and ends over time.
We introduce Elastic Moment Bounding (EMB) to accommodate flexible and adaptive activity temporal boundaries.
arXiv Detail & Related papers (2022-06-26T16:45:56Z) - Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization.
Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting.
Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z) - TimeLens: Event-based Video Frame Interpolation [54.28139783383213]
We introduce Time Lens, a novel indicates equal contribution method that leverages the advantages of both synthesis-based and flow-based approaches.
We show an up to 5.21 dB improvement in terms of PSNR over state-of-the-art frame-based and event-based methods.
arXiv Detail & Related papers (2021-06-14T10:33:47Z) - Temporal Context Aggregation Network for Temporal Action Proposal
Refinement [93.03730692520999]
Temporal action proposal generation is a challenging yet important task in the video understanding field.
Current methods still suffer from inaccurate temporal boundaries and inferior confidence used for retrieval.
We propose TCANet to generate high-quality action proposals through "local and global" temporal context aggregation.
arXiv Detail & Related papers (2021-03-24T12:34:49Z) - Bottom-Up Temporal Action Localization with Mutual Regularization [107.39785866001868]
State-of-the-art solutions for TAL involve evaluating the frame-level probabilities of three action-indicating phases.
We introduce two regularization terms to mutually regularize the learning procedure.
Experiments are performed on two popular TAL datasets, THUMOS14 and ActivityNet1.3.
arXiv Detail & Related papers (2020-02-18T03:59:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.