Domain Adaptive Video Segmentation via Temporal Consistency
Regularization
- URL: http://arxiv.org/abs/2107.11004v1
- Date: Fri, 23 Jul 2021 02:50:42 GMT
- Title: Domain Adaptive Video Segmentation via Temporal Consistency
Regularization
- Authors: Dayan Guan, Jiaxing Huang, Aoran Xiao, Shijian Lu
- Abstract summary: This paper presents DA-VSN, a domain adaptive video segmentation network that addresses domain gaps in videos by temporal consistency regularization (TCR)
The first is cross-domain TCR that guides the prediction of target frames to have similar temporal consistency as that of source frames (learnt from annotated source data) via adversarial learning.
The second is intra-domain TCR that guides unconfident predictions of target frames to have similar temporal consistency as confident predictions of target frames.
- Score: 32.77436219094282
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Video semantic segmentation is an essential task for the analysis and
understanding of videos. Recent efforts largely focus on supervised video
segmentation by learning from fully annotated data, but the learnt models often
experience clear performance drop while applied to videos of a different
domain. This paper presents DA-VSN, a domain adaptive video segmentation
network that addresses domain gaps in videos by temporal consistency
regularization (TCR) for consecutive frames of target-domain videos. DA-VSN
consists of two novel and complementary designs. The first is cross-domain TCR
that guides the prediction of target frames to have similar temporal
consistency as that of source frames (learnt from annotated source data) via
adversarial learning. The second is intra-domain TCR that guides unconfident
predictions of target frames to have similar temporal consistency as confident
predictions of target frames. Extensive experiments demonstrate the superiority
of our proposed domain adaptive video segmentation network which outperforms
multiple baselines consistently by large margins.
Related papers
- Generative Video Diffusion for Unseen Cross-Domain Video Moment
Retrieval [58.17315970207874]
Video Moment Retrieval (VMR) requires precise modelling of fine-grained moment-text associations to capture intricate visual-language relationships.
Existing methods resort to joint training on both source and target domain videos for cross-domain applications.
We explore generative video diffusion for fine-grained editing of source videos controlled by the target sentences.
arXiv Detail & Related papers (2024-01-24T09:45:40Z) - Multi-Modal Domain Adaptation Across Video Scenes for Temporal Video
Grounding [59.599378814835205]
Temporal Video Grounding (TVG) aims to localize the temporal boundary of a specific segment in an untrimmed video based on a given language query.
We introduce a novel AMDA method to adaptively adjust the model's scene-related knowledge by incorporating insights from the target data.
arXiv Detail & Related papers (2023-12-21T07:49:27Z) - Multi-grained Temporal Prototype Learning for Few-shot Video Object
Segmentation [156.4142424784322]
Few-Shot Video Object (FSVOS) aims to segment objects in a query video with the same category defined by a few annotated support images.
We propose to leverage multi-grained temporal guidance information for handling the temporal correlation nature of video data.
Our proposed video IPMT model significantly outperforms previous models on two benchmark datasets.
arXiv Detail & Related papers (2023-09-20T09:16:34Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Diverse Video Captioning by Adaptive Spatio-temporal Attention [7.96569366755701]
Our end-to-end encoder-decoder video captioning framework incorporates two transformer-based architectures.
We introduce an adaptive frame selection scheme to reduce the number of required incoming frames.
We estimate semantic concepts relevant for video captioning by aggregating all ground captions truth of each sample.
arXiv Detail & Related papers (2022-08-19T11:21:59Z) - Domain Adaptive Video Segmentation via Temporal Pseudo Supervision [46.38660541271893]
Video semantic segmentation can mitigate data labelling constraints by adapting from a labelled source domain toward an unlabelled target domain.
We design temporal pseudo supervision (TPS), a simple and effective method that explores the idea of consistency training for representations effective from target videos.
We show that TPS is simpler to implement, much more stable to train, and achieves superior video accuracy as compared with the state-of-the-art.
arXiv Detail & Related papers (2022-07-06T00:36:14Z) - Temporal Transductive Inference for Few-Shot Video Object Segmentation [27.140141181513425]
Few-shot object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training.
Key to our approach is the use of both global and local temporal constraints.
Empirically, our model outperforms state-of-the-art meta-learning approaches in terms of mean intersection over union on YouTube-VIS by 2.8%.
arXiv Detail & Related papers (2022-03-27T14:08:30Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z) - Adversarial Bipartite Graph Learning for Video Domain Adaptation [50.68420708387015]
Domain adaptation techniques, which focus on adapting models between distributionally different domains, are rarely explored in the video recognition area.
Recent works on visual domain adaptation which leverage adversarial learning to unify the source and target video representations are not highly effective on the videos.
This paper proposes an Adversarial Bipartite Graph (ABG) learning framework which directly models the source-target interactions.
arXiv Detail & Related papers (2020-07-31T03:48:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.