Unified Image and Video Saliency Modeling
- URL: http://arxiv.org/abs/2003.05477v3
- Date: Sat, 7 Nov 2020 13:43:34 GMT
- Title: Unified Image and Video Saliency Modeling
- Authors: Richard Droste, Jianbo Jiao, J. Alison Noble
- Abstract summary: We ask: Can image and video saliency modeling be approached via a unified model?
We propose four novel domain adaptation techniques and an improved formulation of learned Gaussian priors.
We integrate these techniques into a simple and lightweight encoder-RNN-decoder-style network, UNISAL, and train it jointly with image and video saliency data.
We evaluate our method on the video saliency datasets DHF1K, Hollywood-2 and UCF-Sports, and the image saliency datasets SALICON and MIT300.
- Score: 21.701431656717112
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual saliency modeling for images and videos is treated as two independent
tasks in recent computer vision literature. While image saliency modeling is a
well-studied problem and progress on benchmarks like SALICON and MIT300 is
slowing, video saliency models have shown rapid gains on the recent DHF1K
benchmark. Here, we take a step back and ask: Can image and video saliency
modeling be approached via a unified model, with mutual benefit? We identify
different sources of domain shift between image and video saliency data and
between different video saliency datasets as a key challenge for effective
joint modelling. To address this we propose four novel domain adaptation
techniques - Domain-Adaptive Priors, Domain-Adaptive Fusion, Domain-Adaptive
Smoothing and Bypass-RNN - in addition to an improved formulation of learned
Gaussian priors. We integrate these techniques into a simple and lightweight
encoder-RNN-decoder-style network, UNISAL, and train it jointly with image and
video saliency data. We evaluate our method on the video saliency datasets
DHF1K, Hollywood-2 and UCF-Sports, and the image saliency datasets SALICON and
MIT300. With one set of parameters, UNISAL achieves state-of-the-art
performance on all video saliency datasets and is on par with the
state-of-the-art for image saliency datasets, despite faster runtime and a 5 to
20-fold smaller model size compared to all competing deep methods. We provide
retrospective analyses and ablation studies which confirm the importance of the
domain shift modeling. The code is available at
https://github.com/rdroste/unisal
Related papers
- WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models [132.77237314239025]
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos.
Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions.
We reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion.
Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach.
arXiv Detail & Related papers (2024-07-15T11:21:03Z) - AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization [115.64739269488965]
VimTS enhances the generalization ability of the model by achieving better synergy among different tasks.
We propose a synthetic video text dataset (VTD-368k) by leveraging the Content Deformation Fields (CoDeF) algorithm.
For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2.
arXiv Detail & Related papers (2024-04-30T15:49:03Z) - GIM: Learning Generalizable Image Matcher From Internet Videos [18.974842517202365]
We propose GIM, a self-training framework for learning a single generalizable model based on any image matching architecture.
We also propose ZEB, the first zero-shot evaluation benchmark for image matching.
arXiv Detail & Related papers (2024-02-16T21:48:17Z) - Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large
Datasets [36.95521842177614]
We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation.
We identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning.
arXiv Detail & Related papers (2023-11-25T22:28:38Z) - Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models [52.93036326078229]
Off-the-shelf billion-scale datasets for image generation are available, but collecting similar video data of the same scale is still challenging.
In this work, we explore finetuning a pretrained image diffusion model with video data as a practical solution for the video synthesis task.
Our model, Preserve Your Own Correlation (PYoCo), attains SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks.
arXiv Detail & Related papers (2023-05-17T17:59:16Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z) - Understanding Road Layout from Videos as a Whole [82.30800791500869]
We formulate it as a top-view road attributes prediction problem and our goal is to predict these attributes for each frame both accurately and consistently.
We exploit the following three novel aspects: leveraging camera motions in videos, including context cuesand incorporating long-term video information.
arXiv Detail & Related papers (2020-07-02T00:59:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.