SalFormer360: a transformer-based saliency estimation model for 360-degree videos
- URL: http://arxiv.org/abs/2602.04584v1
- Date: Wed, 04 Feb 2026 14:11:00 GMT
- Title: SalFormer360: a transformer-based saliency estimation model for 360-degree videos
- Authors: Mahmoud Z. A. Wahba, Francesco Barbato, Sara Baldoni, Federica Battisti,
- Abstract summary: We propose SalFormer360, a novel saliency estimation model for 360-degree videos built on a transformer-based architecture.<n>Our approach is based on the combination of an existing encoder architecture, SegFormer, and a custom decoder.<n>Experiments on the three largest benchmark datasets for saliency estimation demonstrate that SalFormer360 outperforms existing state-of-the-art methods.
- Score: 6.699918556514895
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Saliency estimation has received growing attention in recent years due to its importance in a wide range of applications. In the context of 360-degree video, it has been particularly valuable for tasks such as viewport prediction and immersive content optimization. In this paper, we propose SalFormer360, a novel saliency estimation model for 360-degree videos built on a transformer-based architecture. Our approach is based on the combination of an existing encoder architecture, SegFormer, and a custom decoder. The SegFormer model was originally developed for 2D segmentation tasks, and it has been fine-tuned to adapt it to 360-degree content. To further enhance prediction accuracy in our model, we incorporated Viewing Center Bias to reflect user attention in 360-degree environments. Extensive experiments on the three largest benchmark datasets for saliency estimation demonstrate that SalFormer360 outperforms existing state-of-the-art methods. In terms of Pearson Correlation Coefficient, our model achieves 8.4% higher performance on Sport360, 2.5% on PVS-HM, and 18.6% on VR-EyeTracking compared to previous state-of-the-art.
Related papers
- RPG360: Robust 360 Depth Estimation with Perspective Foundation Models and Graph Optimization [48.99932182976206]
RPG360 is a training-free robust 360 monocular depth estimation method.<n>We introduce a novel depth scale alignment technique using graph-based optimization.<n>Our method achieves superior performance across diverse datasets, including Matterport3D, Stanford2D3D, and 360Loc.
arXiv Detail & Related papers (2025-09-28T17:33:12Z) - Sphere-GAN: a GAN-based Approach for Saliency Estimation in 360° Videos [5.66239168125163]
Saliency estimation provides a powerful tool to identify visually relevant areas.<n>We introduce Sphere-GAN, a saliency detection model for 360deg videos that leverages a Generative Adversarial Network with spherical convolutions.
arXiv Detail & Related papers (2025-09-15T14:07:33Z) - Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360-Degree Videos [15.59763872743732]
This study extends the domain of saliency prediction to 360-degree environments, addressing the complexities of spherical distortion and the integration of spatial audio.<n>Motivated by the lack of comprehensive datasets for 360-degree audio-visual saliency prediction, our study curates YT360-EyeTracking, a new dataset of 81 ODVs.<n>Our goal is to explore how to utilize audio-visual cues to effectively predict visual saliency in 360-degree videos.
arXiv Detail & Related papers (2025-08-27T19:01:47Z) - MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views [90.26609689682876]
We introduce MVSplat360, a feed-forward approach for 360deg novel view synthesis (NVS) of diverse real-world scenes, using only sparse observations.
This setting is inherently ill-posed due to minimal overlap among input views and insufficient visual information provided.
Our model is end-to-end trainable and supports rendering arbitrary views with as few as 5 sparse input views.
arXiv Detail & Related papers (2024-11-07T17:59:31Z) - Depth Anywhere: Enhancing 360 Monocular Depth Estimation via Perspective Distillation and Unlabeled Data Augmentation [6.832852988957967]
We propose a new depth estimation framework that utilizes unlabeled 360-degree data effectively.
Our approach uses state-of-the-art perspective depth estimation models as teacher models to generate pseudo labels.
We tested our approach on benchmark datasets such as Matterport3D and Stanford2D3D, showing significant improvements in depth estimation accuracy.
arXiv Detail & Related papers (2024-06-18T17:59:31Z) - Sp2360: Sparse-view 360 Scene Reconstruction using Cascaded 2D Diffusion Priors [51.36238367193988]
We tackle sparse-view reconstruction of a 360 3D scene using priors from latent diffusion models (LDM)
We present SparseSplat360, a method that employs a cascade of in-painting and artifact removal models to fill in missing details and clean novel views.
Our method generates entire 360 scenes from as few as 9 input views, with a high degree of foreground and background detail.
arXiv Detail & Related papers (2024-05-26T11:01:39Z) - Blind VQA on 360{\deg} Video via Progressively Learning from Pixels,
Frames and Video [66.57045901742922]
Blind visual quality assessment (BVQA) on 360textdegree video plays a key role in optimizing immersive multimedia systems.
In this paper, we take into account the progressive paradigm of human perception towards spherical video quality.
We propose a novel BVQA approach (namely ProVQA) for 360textdegree video via progressively learning from pixels, frames and video.
arXiv Detail & Related papers (2021-11-18T03:45:13Z) - Is Space-Time Attention All You Need for Video Understanding? [50.78676438502343]
We present a convolution-free approach to built exclusively on self-attention over space and time.
"TimeSformer" adapts the standard Transformer architecture to video by enabling feature learning from a sequence of frame-level patches.
TimeSformer achieves state-of-the-art results on several major action recognition benchmarks.
arXiv Detail & Related papers (2021-02-09T19:49:33Z) - ATSal: An Attention Based Architecture for Saliency Prediction in 360
Videos [5.831115928056554]
This paper proposes ATSal, a novel attention based (head-eye) saliency model for 360degree videos.
We compare the proposed approach to other state-of-the-art saliency models on two datasets: Salient360! and VR-EyeTracking.
Experimental results on over 80 ODV videos (75K+ frames) show that the proposed method outperforms the existing state-of-the-art.
arXiv Detail & Related papers (2020-11-20T19:19:48Z) - Deep Learning for Content-based Personalized Viewport Prediction of
360-Degree VR Videos [72.08072170033054]
In this paper, a deep learning network is introduced to leverage position data as well as video frame content to predict future head movement.
For optimizing data input into this neural network, data sample rate, reduced data, and long-period prediction length are also explored for this model.
arXiv Detail & Related papers (2020-03-01T07:31:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.