Spherical Vision Transformer for 360-degree Video Saliency Prediction
- URL: http://arxiv.org/abs/2308.13004v1
- Date: Thu, 24 Aug 2023 18:07:37 GMT
- Title: Spherical Vision Transformer for 360-degree Video Saliency Prediction
- Authors: Mert Cokelek, Nevrez Imamoglu, Cagri Ozcinar, Erkut Erdem, Aykut Erdem
- Abstract summary: We propose a vision-transformer-based model for omnidirectional videos named SalViT360.
We introduce a spherical geometry-aware self-attention mechanism that is capable of effective omnidirectional video understanding.
Our approach is the first to employ tangent images for omnidirectional saliency prediction prediction, and our experimental results on three ODV saliency datasets demonstrate its effectiveness compared to the state-of-the-art.
- Score: 17.948179628551376
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The growing interest in omnidirectional videos (ODVs) that capture the full
field-of-view (FOV) has gained 360-degree saliency prediction importance in
computer vision. However, predicting where humans look in 360-degree scenes
presents unique challenges, including spherical distortion, high resolution,
and limited labelled data. We propose a novel vision-transformer-based model
for omnidirectional videos named SalViT360 that leverages tangent image
representations. We introduce a spherical geometry-aware spatiotemporal
self-attention mechanism that is capable of effective omnidirectional video
understanding. Furthermore, we present a consistency-based unsupervised
regularization term for projection-based 360-degree dense-prediction models to
reduce artefacts in the predictions that occur after inverse projection. Our
approach is the first to employ tangent images for omnidirectional saliency
prediction, and our experimental results on three ODV saliency datasets
demonstrate its effectiveness compared to the state-of-the-art.
Related papers
- Forecasting Future Videos from Novel Views via Disentangled 3D Scene Representation [54.60804602905519]
We learn an entangled representation, aiming to model layered scene geometry, motion forecasting and novel view synthesis together.
Our approach chooses to disentangle scene geometry from scene motion, via lifting the 2D scene to 3D point clouds.
To model future 3D scene motion, we propose a disentangled two-stage approach that initially forecasts ego-motion and subsequently the residual motion of dynamic objects.
arXiv Detail & Related papers (2024-07-31T08:54:50Z) - 360VFI: A Dataset and Benchmark for Omnidirectional Video Frame Interpolation [13.122586587748218]
This paper introduces the benchmark dataset, 360VFI, for Omnidirectional Video Frame Interpolation.
We present a practical implementation that introduces a distortion prior from omnidirectional video into the network to modulate distortions.
arXiv Detail & Related papers (2024-07-19T06:50:24Z) - PanoNormal: Monocular Indoor 360° Surface Normal Estimation [12.992217830651988]
textitPanoNormal is a monocular surface normal estimation architecture designed for 360deg images.
We employ a multi-level global self-attention scheme with the consideration of the spherical feature distribution.
Our results demonstrate that our approach achieves state-of-the-art performance across multiple popular 360deg monocular datasets.
arXiv Detail & Related papers (2024-05-29T04:07:14Z) - Sp2360: Sparse-view 360 Scene Reconstruction using Cascaded 2D Diffusion Priors [51.36238367193988]
We tackle sparse-view reconstruction of a 360 3D scene using priors from latent diffusion models (LDM)
We present SparseSplat360, a method that employs a cascade of in-painting and artifact removal models to fill in missing details and clean novel views.
Our method generates entire 360 scenes from as few as 9 input views, with a high degree of foreground and background detail.
arXiv Detail & Related papers (2024-05-26T11:01:39Z) - HawkI: Homography & Mutual Information Guidance for 3D-free Single Image to Aerial View [67.8213192993001]
We present HawkI, for synthesizing aerial-view images from text and an exemplar image.
HawkI blends the visual features from the input image within a pretrained text-to-2Dimage stable diffusion model.
At inference, HawkI employs a unique mutual information guidance formulation to steer the generated image towards faithfully replicating the semantic details of the input-image.
arXiv Detail & Related papers (2023-11-27T01:41:25Z) - Deceptive-NeRF/3DGS: Diffusion-Generated Pseudo-Observations for High-Quality Sparse-View Reconstruction [60.52716381465063]
We introduce Deceptive-NeRF/3DGS to enhance sparse-view reconstruction with only a limited set of input images.
Specifically, we propose a deceptive diffusion model turning noisy images rendered from few-view reconstructions into high-quality pseudo-observations.
Our system progressively incorporates diffusion-generated pseudo-observations into the training image sets, ultimately densifying the sparse input observations by 5 to 10 times.
arXiv Detail & Related papers (2023-05-24T14:00:32Z) - Panoramic Vision Transformer for Saliency Detection in 360{\deg} Videos [48.54829780502176]
We present a new framework named Panoramic Vision Transformer (PAVER)
We design the encoder using Vision Transformer with deformable convolution, which enables us to plug pretrained models from normal videos into our architecture without additional modules or finetuning.
We demonstrate the utility of our saliency prediction model with the omnidirectional video quality assessment task in VQA-ODV, where we consistently improve performance without any form of supervision.
arXiv Detail & Related papers (2022-09-19T12:23:34Z) - Spherical Convolution empowered FoV Prediction in 360-degree Video
Multicast with Limited FoV Feedback [16.716422953229088]
Field of view (FoV) prediction is critical in 360-degree video multicast.
This paper proposes a spherical convolution-empowered FoV prediction method.
The experimental results show that the performance of the proposed method is better than other prediction methods.
arXiv Detail & Related papers (2022-01-29T08:32:19Z) - ATSal: An Attention Based Architecture for Saliency Prediction in 360
Videos [5.831115928056554]
This paper proposes ATSal, a novel attention based (head-eye) saliency model for 360degree videos.
We compare the proposed approach to other state-of-the-art saliency models on two datasets: Salient360! and VR-EyeTracking.
Experimental results on over 80 ODV videos (75K+ frames) show that the proposed method outperforms the existing state-of-the-art.
arXiv Detail & Related papers (2020-11-20T19:19:48Z) - Perceptual Quality Assessment of Omnidirectional Images as Moving Camera
Videos [49.217528156417906]
Two types of VR viewing conditions are crucial in determining the viewing behaviors of users and the perceived quality of the panorama.
We first transform an omnidirectional image to several video representations using different user viewing behaviors under different viewing conditions.
We then leverage advanced 2D full-reference video quality models to compute the perceived quality.
arXiv Detail & Related papers (2020-05-21T10:03:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.