TinyHD: Efficient Video Saliency Prediction with Heterogeneous Decoders
using Hierarchical Maps Distillation
- URL: http://arxiv.org/abs/2301.04619v1
- Date: Wed, 11 Jan 2023 18:20:19 GMT
- Title: TinyHD: Efficient Video Saliency Prediction with Heterogeneous Decoders
using Hierarchical Maps Distillation
- Authors: Feiyan Hu, Simone Palazzo, Federica Proietto Salanitri, Giovanni
Bellitto, Morteza Moradi, Concetto Spampinato, Kevin McGuinness
- Abstract summary: We propose a lightweight model that employs multiple simple heterogeneous decoders.
Our approach achieves saliency prediction accuracy on par or better than state-of-the-art methods.
- Score: 16.04961815178485
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video saliency prediction has recently attracted attention of the research
community, as it is an upstream task for several practical applications.
However, current solutions are particularly computationally demanding,
especially due to the wide usage of spatio-temporal 3D convolutions. We observe
that, while different model architectures achieve similar performance on
benchmarks, visual variations between predicted saliency maps are still
significant. Inspired by this intuition, we propose a lightweight model that
employs multiple simple heterogeneous decoders and adopts several practical
approaches to improve accuracy while keeping computational costs low, such as
hierarchical multi-map knowledge distillation, multi-output saliency
prediction, unlabeled auxiliary datasets and channel reduction with teacher
assistant supervision. Our approach achieves saliency prediction accuracy on
par or better than state-of-the-art methods on DFH1K, UCF-Sports and Hollywood2
benchmarks, while enhancing significantly the efficiency of the model. Code is
on https://github.com/feiyanhu/tinyHD
Related papers
- Clover-2: Accurate Inference for Regressive Lightweight Speculative Decoding [8.046705062670096]
Regressive lightweight speculative decoding has garnered attention for its notable efficiency improvements in text generation tasks.
Clover-2 is an RNN-based draft model designed to achieve comparable accuracy to that of attention decoder layer models.
arXiv Detail & Related papers (2024-08-01T03:43:32Z) - SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Gait Recognition in the Wild with Multi-hop Temporal Switch [81.35245014397759]
gait recognition in the wild is a more practical problem that has attracted the attention of the community of multimedia and computer vision.
This paper presents a novel multi-hop temporal switch method to achieve effective temporal modeling of gait patterns in real-world scenes.
arXiv Detail & Related papers (2022-09-01T10:46:09Z) - From Single to Multiple: Leveraging Multi-level Prediction Spaces for
Video Forecasting [37.322499502542556]
We study numerous strategies to perform video forecasting in multi-prediction spaces and fuse their results together to boost performance.
We show that our model significantly reduces the troublesome distortions and blurry artifacts and brings remarkable improvements to the accuracy in long term video prediction.
arXiv Detail & Related papers (2021-07-21T13:23:16Z) - When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable.
In order to achieve a better accuracy, we propose two lightweight modules.
DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers.
QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z) - Is Space-Time Attention All You Need for Video Understanding? [50.78676438502343]
We present a convolution-free approach to built exclusively on self-attention over space and time.
"TimeSformer" adapts the standard Transformer architecture to video by enabling feature learning from a sequence of frame-level patches.
TimeSformer achieves state-of-the-art results on several major action recognition benchmarks.
arXiv Detail & Related papers (2021-02-09T19:49:33Z) - Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks [87.50632573601283]
We present a novel method for multi-view depth estimation from a single video.
Our method achieves temporally coherent depth estimation results by using a novel Epipolar Spatio-Temporal (EST) transformer.
To reduce the computational cost, inspired by recent Mixture-of-Experts models, we design a compact hybrid network.
arXiv Detail & Related papers (2020-11-26T04:04:21Z) - A Compact Deep Architecture for Real-time Saliency Prediction [42.58396452892243]
Saliency models aim to imitate the attention mechanism in the human visual system.
Deep models have a high number of parameters which makes them less suitable for real-time applications.
Here we propose a compact yet fast model for real-time saliency prediction.
arXiv Detail & Related papers (2020-08-30T17:47:16Z) - A Real-time Action Representation with Temporal Encoding and Deep
Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation.
T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed.
Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.