Related papers: We're Not Using Videos Effectively: An Updated Domain Adaptive Video Segmentation Baseline

We're Not Using Videos Effectively: An Updated Domain Adaptive Video Segmentation Baseline

URL: http://arxiv.org/abs/2402.00868v3
Date: Tue, 27 Feb 2024 22:25:15 GMT
Title: We're Not Using Videos Effectively: An Updated Domain Adaptive Video Segmentation Baseline
Authors: Simar Kareer, Vivek Vijaykumar, Harsh Maheshwari, Prithvijit Chattopadhyay, Judy Hoffman, Viraj Prabhu
Abstract summary: Video-DAS works have historically studied a distinct set of benchmarks from Image-DAS, with minimal cross-benchmarking. We find that even after carefully controlling for data and model architecture, state-of-the-art Image-DAS methods outperform Video-DAS methods on established Video-DAS benchmarks.
Score: 19.098970392639476
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: There has been abundant work in unsupervised domain adaptation for semantic segmentation (DAS) seeking to adapt a model trained on images from a labeled source domain to an unlabeled target domain. While the vast majority of prior work has studied this as a frame-level Image-DAS problem, a few Video-DAS works have sought to additionally leverage the temporal signal present in adjacent frames. However, Video-DAS works have historically studied a distinct set of benchmarks from Image-DAS, with minimal cross-benchmarking. In this work, we address this gap. Surprisingly, we find that (1) even after carefully controlling for data and model architecture, state-of-the-art Image-DAS methods (HRDA and HRDA+MIC) outperform Video-DAS methods on established Video-DAS benchmarks (+14.5 mIoU on Viper$\rightarrow$CityscapesSeq, +19.0 mIoU on Synthia$\rightarrow$CityscapesSeq), and (2) naive combinations of Image-DAS and Video-DAS techniques only lead to marginal improvements across datasets. To avoid siloed progress between Image-DAS and Video-DAS, we open-source our codebase with support for a comprehensive set of Video-DAS and Image-DAS methods on a common benchmark. Code available at https://github.com/SimarKareer/UnifiedVideoDA

Related papers

Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors [54.8852848659663]
Buffer Anytime is a framework for estimation of depth and normal maps (which we call geometric buffers) from video. We demonstrate high-quality video buffer estimation by leveraging single-image priors with temporal consistency constraints.
arXiv Detail & Related papers (2024-11-26T09:28:32Z)
GIM: Learning Generalizable Image Matcher From Internet Videos [18.974842517202365]
We propose GIM, a self-training framework for learning a single generalizable model based on any image matching architecture. We also propose ZEB, the first zero-shot evaluation benchmark for image matching.
arXiv Detail & Related papers (2024-02-16T21:48:17Z)
FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline [4.295130967329365]
This paper presents a new two-stage latent diffusion text-to-video generation architecture based on the text-to-image diffusion model. The design of our model significantly reduces computational costs compared to other masked frame approaches. We evaluate different configurations of MoVQ-based video decoding scheme to improve consistency and achieve higher PSNR, SSIM, MSE, and LPIPS scores.
arXiv Detail & Related papers (2023-11-22T00:26:15Z)
Memory Efficient Temporal & Visual Graph Model for Unsupervised Video Domain Adaptation [50.158454960223274]
Existing video domain adaption (DA) methods need to store all temporal combinations of video frames or pair the source and target videos. We propose a memory-efficient graph-based video DA approach.
arXiv Detail & Related papers (2022-08-13T02:56:10Z)
Unsupervised Domain Adaptation for Video Transformers in Action Recognition [76.31442702219461]
We propose a simple and novel UDA approach for video action recognition. Our approach builds a robust source model that better generalises to target domain. We report results on two video action benchmarks recognition for UDA.
arXiv Detail & Related papers (2022-07-26T12:17:39Z)
VRAG: Region Attention Graphs for Content-Based Video Retrieval [85.54923500208041]
Region Attention Graph Networks (VRAG) improves the state-of-the-art video-level methods. VRAG represents videos at a finer granularity via region-level features and encodes video-temporal dynamics through region-level relations. We show that the performance gap between video-level and frame-level methods can be reduced by segmenting videos into shots and using shot embeddings for video retrieval.
arXiv Detail & Related papers (2022-05-18T16:50:45Z)
CycDA: Unsupervised Cycle Domain Adaptation from Image to Video [26.30914383638721]
Domain Cycle Adaptation (CycDA) is a cycle-based approach for unsupervised image-to-video domain adaptation. We evaluate our approach on benchmark datasets for image-to-video and for mixed-source domain adaptation.
arXiv Detail & Related papers (2022-03-30T12:22:26Z)
Box Supervised Video Segmentation Proposal Network [3.384080569028146]
We propose a box-supervised video object segmentation proposal network, which takes advantage of intrinsic video properties. The proposed method outperforms the state-of-the-art self-supervised benchmark by 16.4% and 6.9%. We provide extensive tests and ablations on the datasets, demonstrating the robustness of our method.
arXiv Detail & Related papers (2022-02-14T20:38:28Z)
Object Propagation via Inter-Frame Attentions for Temporally Stable Video Instance Segmentation [51.68840525174265]
Video instance segmentation aims to detect, segment, and track objects in a video. Current approaches extend image-level segmentation algorithms to the temporal domain. We propose a video instance segmentation method that alleviates the problem due to missing detections.
arXiv Detail & Related papers (2021-11-15T04:15:57Z)
Unsupervised Domain Adaptation for Video Semantic Segmentation [91.30558794056054]
Unsupervised Domain Adaptation for semantic segmentation has gained immense popularity since it can transfer knowledge from simulation to real. In this work, we present a new video extension of this task, namely Unsupervised Domain Adaptation for Video Semantic approaches. We show that our proposals significantly outperform previous image-based UDA methods both on image-level (mIoU) and video-level (VPQ) evaluation metrics.
arXiv Detail & Related papers (2021-07-23T07:18:20Z)
Unified Image and Video Saliency Modeling [21.701431656717112]
We ask: Can image and video saliency modeling be approached via a unified model? We propose four novel domain adaptation techniques and an improved formulation of learned Gaussian priors. We integrate these techniques into a simple and lightweight encoder-RNN-decoder-style network, UNISAL, and train it jointly with image and video saliency data. We evaluate our method on the video saliency datasets DHF1K, Hollywood-2 and UCF-Sports, and the image saliency datasets SALICON and MIT300.
arXiv Detail & Related papers (2020-03-11T18:28:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.