A Plug-and-play Scheme to Adapt Image Saliency Deep Model for Video Data
- URL: http://arxiv.org/abs/2008.09103v1
- Date: Sun, 2 Aug 2020 13:23:14 GMT
- Title: A Plug-and-play Scheme to Adapt Image Saliency Deep Model for Video Data
- Authors: Yunxiao Li, Shuai Li, Chenglizhao Chen, Aimin Hao, Hong Qin
- Abstract summary: This paper proposes a novel plug-and-play scheme to weakly retrain a pretrained image saliency deep model for video data.
Our method is simple yet effective for adapting any off-the-shelf pre-trained image saliency deep model to obtain high-quality video saliency detection.
- Score: 54.198279280967185
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rapid development of deep learning techniques, image saliency deep
models trained solely by spatial information have occasionally achieved
detection performance for video data comparable to that of the models trained
by both spatial and temporal information. However, due to the lesser
consideration of temporal information, the image saliency deep models may
become fragile in the video sequences dominated by temporal information. Thus,
the most recent video saliency detection approaches have adopted the network
architecture starting with a spatial deep model that is followed by an
elaborately designed temporal deep model. However, such methods easily
encounter the performance bottleneck arising from the single stream learning
methodology, so the overall detection performance is largely determined by the
spatial deep model. In sharp contrast to the current mainstream methods, this
paper proposes a novel plug-and-play scheme to weakly retrain a pretrained
image saliency deep model for video data by using the newly sensed and coded
temporal information. Thus, the retrained image saliency deep model will be
able to maintain temporal saliency awareness, achieving much improved detection
performance. Moreover, our method is simple yet effective for adapting any
off-the-shelf pre-trained image saliency deep model to obtain high-quality
video saliency detection. Additionally, both the data and source code of our
method are publicly available.
Related papers
- Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors [54.8852848659663]
Buffer Anytime is a framework for estimation of depth and normal maps (which we call geometric buffers) from video.
We demonstrate high-quality video buffer estimation by leveraging single-image priors with temporal consistency constraints.
arXiv Detail & Related papers (2024-11-26T09:28:32Z) - Flatten: Video Action Recognition is an Image Classification task [15.518011818978074]
A novel video representation architecture, Flatten, serves as a plug-and-play module that can be seamlessly integrated into any image-understanding network.
Experiments on commonly used datasets have demonstrated that embedding Flatten provides significant performance improvements over original model.
arXiv Detail & Related papers (2024-08-17T14:59:58Z) - Learning Temporally Consistent Video Depth from Video Diffusion Priors [57.929828486615605]
This work addresses the challenge of video depth estimation.
We reformulate the prediction task into a conditional generation problem.
This allows us to leverage the prior knowledge embedded in existing video generation models.
arXiv Detail & Related papers (2024-06-03T16:20:24Z) - DepthFM: Fast Monocular Depth Estimation with Flow Matching [22.206355073676082]
Current discriminative approaches to this problem are limited due to blurry artifacts.
State-of-the-art generative methods suffer from slow sampling due to their SDE nature.
We observe that this can be effectively framed using flow matching, since its straight trajectories through solution space offer efficiency and high quality.
arXiv Detail & Related papers (2024-03-20T17:51:53Z) - Learning Fine-Grained Visual Understanding for Video Question Answering
via Decoupling Spatial-Temporal Modeling [28.530765643908083]
We decouple spatial-temporal modeling and integrate an image- and a video-language to learn fine-grained visual understanding.
We propose a novel pre-training objective, Temporal Referring Modeling, which requires the model to identify temporal positions of events in video sequences.
Our model outperforms previous work pre-trained on orders of magnitude larger datasets.
arXiv Detail & Related papers (2022-10-08T07:03:31Z) - Why-So-Deep: Towards Boosting Previously Trained Models for Visual Place
Recognition [12.807343105549409]
We present an intelligent method, MAQBOOL, to amplify the power of pre-trained models for better image recall.
We achieve comparable image retrieval results at a low descriptor dimension (512-D), compared to the high descriptor dimension (4096-D) of state-of-the-art methods.
arXiv Detail & Related papers (2022-01-10T08:39:06Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z) - Visualising Deep Network's Time-Series Representations [93.73198973454944]
Despite the popularisation of machine learning models, more often than not they still operate as black boxes with no insight into what is happening inside the model.
In this paper, a method that addresses that issue is proposed, with a focus on visualising multi-dimensional time-series data.
Experiments on a high-frequency stock market dataset show that the method provides fast and discernible visualisations.
arXiv Detail & Related papers (2021-03-12T09:53:34Z) - Cascaded Deep Video Deblurring Using Temporal Sharpness Prior [88.98348546566675]
The proposed algorithm mainly consists of optical flow estimation from intermediate latent frames and latent frame restoration steps.
It first develops a deep CNN model to estimate optical flow from intermediate latent frames and then restores the latent frames based on the estimated optical flow.
We show that exploring the domain knowledge of video deblurring is able to make the deep CNN model more compact and efficient.
arXiv Detail & Related papers (2020-04-06T09:13:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.