We're Not Using Videos Effectively: An Updated Domain Adaptive Video
Segmentation Baseline
- URL: http://arxiv.org/abs/2402.00868v3
- Date: Tue, 27 Feb 2024 22:25:15 GMT
- Title: We're Not Using Videos Effectively: An Updated Domain Adaptive Video
Segmentation Baseline
- Authors: Simar Kareer, Vivek Vijaykumar, Harsh Maheshwari, Prithvijit
Chattopadhyay, Judy Hoffman, Viraj Prabhu
- Abstract summary: Video-DAS works have historically studied a distinct set of benchmarks from Image-DAS, with minimal cross-benchmarking.
We find that even after carefully controlling for data and model architecture, state-of-the-art Image-DAS methods outperform Video-DAS methods on established Video-DAS benchmarks.
- Score: 19.098970392639476
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There has been abundant work in unsupervised domain adaptation for semantic
segmentation (DAS) seeking to adapt a model trained on images from a labeled
source domain to an unlabeled target domain. While the vast majority of prior
work has studied this as a frame-level Image-DAS problem, a few Video-DAS works
have sought to additionally leverage the temporal signal present in adjacent
frames. However, Video-DAS works have historically studied a distinct set of
benchmarks from Image-DAS, with minimal cross-benchmarking. In this work, we
address this gap. Surprisingly, we find that (1) even after carefully
controlling for data and model architecture, state-of-the-art Image-DAS methods
(HRDA and HRDA+MIC) outperform Video-DAS methods on established Video-DAS
benchmarks (+14.5 mIoU on Viper$\rightarrow$CityscapesSeq, +19.0 mIoU on
Synthia$\rightarrow$CityscapesSeq), and (2) naive combinations of Image-DAS and
Video-DAS techniques only lead to marginal improvements across datasets. To
avoid siloed progress between Image-DAS and Video-DAS, we open-source our
codebase with support for a comprehensive set of Video-DAS and Image-DAS
methods on a common benchmark. Code available at
https://github.com/SimarKareer/UnifiedVideoDA
Related papers
- Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors [54.8852848659663]
Buffer Anytime is a framework for estimation of depth and normal maps (which we call geometric buffers) from video.
We demonstrate high-quality video buffer estimation by leveraging single-image priors with temporal consistency constraints.
arXiv Detail & Related papers (2024-11-26T09:28:32Z) - GIM: Learning Generalizable Image Matcher From Internet Videos [18.974842517202365]
We propose GIM, a self-training framework for learning a single generalizable model based on any image matching architecture.
We also propose ZEB, the first zero-shot evaluation benchmark for image matching.
arXiv Detail & Related papers (2024-02-16T21:48:17Z) - FusionFrames: Efficient Architectural Aspects for Text-to-Video
Generation Pipeline [4.295130967329365]
This paper presents a new two-stage latent diffusion text-to-video generation architecture based on the text-to-image diffusion model.
The design of our model significantly reduces computational costs compared to other masked frame approaches.
We evaluate different configurations of MoVQ-based video decoding scheme to improve consistency and achieve higher PSNR, SSIM, MSE, and LPIPS scores.
arXiv Detail & Related papers (2023-11-22T00:26:15Z) - Memory Efficient Temporal & Visual Graph Model for Unsupervised Video
Domain Adaptation [50.158454960223274]
Existing video domain adaption (DA) methods need to store all temporal combinations of video frames or pair the source and target videos.
We propose a memory-efficient graph-based video DA approach.
arXiv Detail & Related papers (2022-08-13T02:56:10Z) - Unsupervised Domain Adaptation for Video Transformers in Action
Recognition [76.31442702219461]
We propose a simple and novel UDA approach for video action recognition.
Our approach builds a robust source model that better generalises to target domain.
We report results on two video action benchmarks recognition for UDA.
arXiv Detail & Related papers (2022-07-26T12:17:39Z) - VRAG: Region Attention Graphs for Content-Based Video Retrieval [85.54923500208041]
Region Attention Graph Networks (VRAG) improves the state-of-the-art video-level methods.
VRAG represents videos at a finer granularity via region-level features and encodes video-temporal dynamics through region-level relations.
We show that the performance gap between video-level and frame-level methods can be reduced by segmenting videos into shots and using shot embeddings for video retrieval.
arXiv Detail & Related papers (2022-05-18T16:50:45Z) - CycDA: Unsupervised Cycle Domain Adaptation from Image to Video [26.30914383638721]
Domain Cycle Adaptation (CycDA) is a cycle-based approach for unsupervised image-to-video domain adaptation.
We evaluate our approach on benchmark datasets for image-to-video and for mixed-source domain adaptation.
arXiv Detail & Related papers (2022-03-30T12:22:26Z) - Box Supervised Video Segmentation Proposal Network [3.384080569028146]
We propose a box-supervised video object segmentation proposal network, which takes advantage of intrinsic video properties.
The proposed method outperforms the state-of-the-art self-supervised benchmark by 16.4% and 6.9%.
We provide extensive tests and ablations on the datasets, demonstrating the robustness of our method.
arXiv Detail & Related papers (2022-02-14T20:38:28Z) - Object Propagation via Inter-Frame Attentions for Temporally Stable
Video Instance Segmentation [51.68840525174265]
Video instance segmentation aims to detect, segment, and track objects in a video.
Current approaches extend image-level segmentation algorithms to the temporal domain.
We propose a video instance segmentation method that alleviates the problem due to missing detections.
arXiv Detail & Related papers (2021-11-15T04:15:57Z) - Unsupervised Domain Adaptation for Video Semantic Segmentation [91.30558794056054]
Unsupervised Domain Adaptation for semantic segmentation has gained immense popularity since it can transfer knowledge from simulation to real.
In this work, we present a new video extension of this task, namely Unsupervised Domain Adaptation for Video Semantic approaches.
We show that our proposals significantly outperform previous image-based UDA methods both on image-level (mIoU) and video-level (VPQ) evaluation metrics.
arXiv Detail & Related papers (2021-07-23T07:18:20Z) - Unified Image and Video Saliency Modeling [21.701431656717112]
We ask: Can image and video saliency modeling be approached via a unified model?
We propose four novel domain adaptation techniques and an improved formulation of learned Gaussian priors.
We integrate these techniques into a simple and lightweight encoder-RNN-decoder-style network, UNISAL, and train it jointly with image and video saliency data.
We evaluate our method on the video saliency datasets DHF1K, Hollywood-2 and UCF-Sports, and the image saliency datasets SALICON and MIT300.
arXiv Detail & Related papers (2020-03-11T18:28:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.