Panoramic Vision Transformer for Saliency Detection in 360{\deg} Videos
- URL: http://arxiv.org/abs/2209.08956v1
- Date: Mon, 19 Sep 2022 12:23:34 GMT
- Title: Panoramic Vision Transformer for Saliency Detection in 360{\deg} Videos
- Authors: Heeseung Yun, Sehun Lee, Gunhee Kim
- Abstract summary: We present a new framework named Panoramic Vision Transformer (PAVER)
We design the encoder using Vision Transformer with deformable convolution, which enables us to plug pretrained models from normal videos into our architecture without additional modules or finetuning.
We demonstrate the utility of our saliency prediction model with the omnidirectional video quality assessment task in VQA-ODV, where we consistently improve performance without any form of supervision.
- Score: 48.54829780502176
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 360$^\circ$ video saliency detection is one of the challenging benchmarks for
360$^\circ$ video understanding since non-negligible distortion and
discontinuity occur in the projection of any format of 360$^\circ$ videos, and
capture-worthy viewpoint in the omnidirectional sphere is ambiguous by nature.
We present a new framework named Panoramic Vision Transformer (PAVER). We
design the encoder using Vision Transformer with deformable convolution, which
enables us not only to plug pretrained models from normal videos into our
architecture without additional modules or finetuning but also to perform
geometric approximation only once, unlike previous deep CNN-based approaches.
Thanks to its powerful encoder, PAVER can learn the saliency from three simple
relative relations among local patch features, outperforming state-of-the-art
models for the Wild360 benchmark by large margins without supervision or
auxiliary information like class activation. We demonstrate the utility of our
saliency prediction model with the omnidirectional video quality assessment
task in VQA-ODV, where we consistently improve performance without any form of
supervision, including head movement.
Related papers
- Cascaded Dual Vision Transformer for Accurate Facial Landmark Detection [9.912884384424542]
This paper introduces a new facial landmark detector based on vision transformers, which consists of two unique designs: Dual Vision Transformer (D-ViT) and Long Skip Connections (LSC)
We propose learning the interconnections between these linear bases to model the inherent geometric relations among landmarks via Channel-split ViT.
We also suggest using long skip connections to deliver low-level image features to all prediction blocks, thereby preventing useful information from being discarded by intermediate supervision.
arXiv Detail & Related papers (2024-11-08T07:26:39Z) - MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views [90.26609689682876]
We introduce MVSplat360, a feed-forward approach for 360deg novel view synthesis (NVS) of diverse real-world scenes, using only sparse observations.
This setting is inherently ill-posed due to minimal overlap among input views and insufficient visual information provided.
Our model is end-to-end trainable and supports rendering arbitrary views with as few as 5 sparse input views.
arXiv Detail & Related papers (2024-11-07T17:59:31Z) - 360VFI: A Dataset and Benchmark for Omnidirectional Video Frame Interpolation [13.122586587748218]
This paper introduces the benchmark dataset, 360VFI, for Omnidirectional Video Frame Interpolation.
We present a practical implementation that introduces a distortion prior from omnidirectional video into the network to modulate distortions.
arXiv Detail & Related papers (2024-07-19T06:50:24Z) - Spherical Vision Transformer for 360-degree Video Saliency Prediction [17.948179628551376]
We propose a vision-transformer-based model for omnidirectional videos named SalViT360.
We introduce a spherical geometry-aware self-attention mechanism that is capable of effective omnidirectional video understanding.
Our approach is the first to employ tangent images for omnidirectional saliency prediction prediction, and our experimental results on three ODV saliency datasets demonstrate its effectiveness compared to the state-of-the-art.
arXiv Detail & Related papers (2023-08-24T18:07:37Z) - Dual-path Adaptation from Image to Video Transformers [62.056751480114784]
We efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters.
We propose a novel DualPath adaptation separated into spatial and temporal adaptation paths, where a lightweight bottleneck adapter is employed in each transformer block.
arXiv Detail & Related papers (2023-03-17T09:37:07Z) - SVFormer: Semi-supervised Video Transformer for Action Recognition [88.52042032347173]
We introduce SVFormer, which adopts a steady pseudo-labeling framework to cope with unlabeled video samples.
In addition, we propose a temporal warping to cover the complex temporal variation in videos.
In particular, SVFormer outperforms the state-of-the-art by 31.5% with fewer training epochs under the 1% labeling rate of Kinetics-400.
arXiv Detail & Related papers (2022-11-23T18:58:42Z) - Multiview Transformers for Video Recognition [69.50552269271526]
We present Multiview Video Recognition (MTV) for different resolutions.
MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost.
We achieve state-of-the-art results on five standard datasets, and improve even further with large-scale pretraining.
arXiv Detail & Related papers (2022-01-12T03:33:57Z) - Self-supervised Video Transformer [46.295395772938214]
From a given video, we create local and global views with varying spatial sizes and frame rates.
Our self-supervised objective seeks to match the features of different views representing the same video to be intemporal.
Our approach performs well on four action benchmarks and converges faster with small batch sizes.
arXiv Detail & Related papers (2021-12-02T18:59:02Z) - Blind VQA on 360{\deg} Video via Progressively Learning from Pixels,
Frames and Video [66.57045901742922]
Blind visual quality assessment (BVQA) on 360textdegree video plays a key role in optimizing immersive multimedia systems.
In this paper, we take into account the progressive paradigm of human perception towards spherical video quality.
We propose a novel BVQA approach (namely ProVQA) for 360textdegree video via progressively learning from pixels, frames and video.
arXiv Detail & Related papers (2021-11-18T03:45:13Z) - Revisiting Optical Flow Estimation in 360 Videos [9.997208301312956]
We design LiteFlowNet360 as a domain adaptation framework from perspective video domain to 360 video domain.
We adapt it from simple kernel transformation techniques inspired by Kernel Transformer Network (KTN) to cope with inherent distortion in 360 videos.
Experimental results show the promising results of 360 video optical flow estimation using the proposed novel architecture.
arXiv Detail & Related papers (2020-10-15T22:22:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.