TSalV360: A Method and Dataset for Text-driven Saliency Detection in 360-Degrees Videos
- URL: http://arxiv.org/abs/2509.26208v1
- Date: Tue, 30 Sep 2025 13:11:16 GMT
- Title: TSalV360: A Method and Dataset for Text-driven Saliency Detection in 360-Degrees Videos
- Authors: Ioannis Kontostathis, Evlampios Apostolidis, Vasileios Mezaris,
- Abstract summary: We deal with the task of text-driven saliency detection in 360-degrees videos.<n>We introduce the TSV360 dataset which includes 16,000 triplets of ERP frames.<n>Following, we adapt a SOTA visual-based approach for 360-degrees video saliency detection.
- Score: 5.531123091747035
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we deal with the task of text-driven saliency detection in 360-degrees videos. For this, we introduce the TSV360 dataset which includes 16,000 triplets of ERP frames, textual descriptions of salient objects/events in these frames, and the associated ground-truth saliency maps. Following, we extend and adapt a SOTA visual-based approach for 360-degrees video saliency detection, and develop the TSalV360 method that takes into account a user-provided text description of the desired objects and/or events. This method leverages a SOTA vision-language model for data representation and integrates a similarity estimation module and a viewport spatio-temporal cross-attention mechanism, to discover dependencies between the different data modalities. Quantitative and qualitative evaluations using the TSV360 dataset, showed the competitiveness of TSalV360 compared to a SOTA visual-based approach and documented its competency to perform customized text-driven saliency detection in 360-degrees videos.
Related papers
- Text-based Aerial-Ground Person Retrieval [55.31140361809554]
This work introduces Text-based Aerial-Ground Person Retrieval (TAG-PR)<n>It aims to retrieve person images from heterogeneous aerial and ground views with textual descriptions.
arXiv Detail & Related papers (2025-11-11T15:49:04Z) - Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360-Degree Videos [15.59763872743732]
This study extends the domain of saliency prediction to 360-degree environments, addressing the complexities of spherical distortion and the integration of spatial audio.<n>Motivated by the lack of comprehensive datasets for 360-degree audio-visual saliency prediction, our study curates YT360-EyeTracking, a new dataset of 81 ODVs.<n>Our goal is to explore how to utilize audio-visual cues to effectively predict visual saliency in 360-degree videos.
arXiv Detail & Related papers (2025-08-27T19:01:47Z) - Beyond Simple Edits: Composed Video Retrieval with Dense Modifications [96.46069692338645]
We introduce a novel dataset that captures both fine-grained and composed actions across diverse video segments.<n>Dense-WebVid-CoVR consists of 1.6 million samples with dense modification text that is around seven times more than its existing counterpart.<n>We develop a new model that integrates visual and textual information through Cross-Attention (CA) fusion.
arXiv Detail & Related papers (2025-08-19T17:59:39Z) - MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views [90.26609689682876]
We introduce MVSplat360, a feed-forward approach for 360deg novel view synthesis (NVS) of diverse real-world scenes, using only sparse observations.
This setting is inherently ill-posed due to minimal overlap among input views and insufficient visual information provided.
Our model is end-to-end trainable and supports rendering arbitrary views with as few as 5 sparse input views.
arXiv Detail & Related papers (2024-11-07T17:59:31Z) - A Human-Annotated Video Dataset for Training and Evaluation of 360-Degree Video Summarization Methods [6.076406622352117]
We introduce a new dataset for 360-degree video summarization: the transformation of 360-degree video content to concise 2D-video summaries.
The dataset includes ground-truth human-generated summaries, that can be used for training and objectively evaluating 360-degree video summarization methods.
arXiv Detail & Related papers (2024-06-05T06:43:48Z) - 360VOTS: Visual Object Tracking and Segmentation in Omnidirectional Videos [16.372814014632944]
We propose a comprehensive dataset and benchmark that incorporates a new component called omnidirectional video object segmentation (360VOS)<n>360VOS dataset includes 290 sequences accompanied by dense pixel-wise masks and covers a broader range of target categories.<n>We benchmark state-of-the-art approaches and demonstrate the effectiveness of our proposed 360 tracking framework and training dataset.
arXiv Detail & Related papers (2024-04-22T07:54:53Z) - Multi-Sentence Grounding for Long-term Instructional Video [63.27905419718045]
We aim to establish an automatic, scalable pipeline for denoising a large-scale instructional dataset.
We construct a high-quality video-text dataset with multiple descriptive steps supervision, named HowToStep.
arXiv Detail & Related papers (2023-12-21T17:28:09Z) - An Integrated System for Spatio-Temporal Summarization of 360-degrees
Videos [6.8292720972215974]
We present an integrated system for summarization of 360-degrees videos.
The video production mainly involves the detection of events and their synopsis into a concise summary.
The analysis relies on state-of-the-art methods for saliency detection in 360-degrees video.
arXiv Detail & Related papers (2023-12-05T08:48:31Z) - ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with
GPT and Prototype Guidance [48.748738590964216]
We propose ViewRefer, a multi-view framework for 3D visual grounding.
For the text branch, ViewRefer expands a single grounding text to multiple geometry-consistent descriptions.
In the 3D modality, a transformer fusion module with inter-view attention is introduced to boost the interaction of objects across views.
arXiv Detail & Related papers (2023-03-29T17:59:10Z) - Multi-Projection Fusion and Refinement Network for Salient Object
Detection in 360{\deg} Omnidirectional Image [141.10227079090419]
We propose a Multi-Projection Fusion and Refinement Network (MPFR-Net) to detect the salient objects in 360deg omnidirectional image.
MPFR-Net uses the equirectangular projection image and four corresponding cube-unfolding images as inputs.
Experimental results on two omnidirectional datasets demonstrate that the proposed approach outperforms the state-of-the-art methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2022-12-23T14:50:40Z) - KITTI-360: A Novel Dataset and Benchmarks for Urban Scene Understanding
in 2D and 3D [67.50776195828242]
KITTI-360 is a suburban driving dataset which comprises richer input modalities, comprehensive semantic instance annotations and accurate localization.
For efficient annotation, we created a tool to label 3D scenes with bounding primitives, resulting in over 150k semantic and instance annotated images and 1B annotated 3D points.
We established benchmarks and baselines for several tasks relevant to mobile perception, encompassing problems from computer vision, graphics, and robotics on the same dataset.
arXiv Detail & Related papers (2021-09-28T00:41:29Z) - A Fixation-based 360{\deg} Benchmark Dataset for Salient Object
Detection [21.314578493964333]
Fixation prediction (FP) in panoramic contents has been widely investigated along with the booming trend of virtual reality (VR) applications.
salient object detection (SOD) has been seldom explored in 360deg images due to the lack of datasets representative of real scenes.
arXiv Detail & Related papers (2020-01-22T11:16:39Z) - Visual Question Answering on 360{\deg} Images [96.00046925811515]
VQA 360 is a novel task of visual question answering on 360 images.
We collect the first VQA 360 dataset, containing around 17,000 real-world image-question-answer triplets for a variety of question types.
arXiv Detail & Related papers (2020-01-10T08:18:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.