UniST: Towards Unifying Saliency Transformer for Video Saliency
Prediction and Detection
- URL: http://arxiv.org/abs/2309.08220v1
- Date: Fri, 15 Sep 2023 07:39:53 GMT
- Title: UniST: Towards Unifying Saliency Transformer for Video Saliency
Prediction and Detection
- Authors: Junwen Xiong, Peng Zhang, Chuanyue Li, Wei Huang, Yufei Zha, Tao You
- Abstract summary: We introduce the Unified Saliency Transformer (UniST) framework, which comprehensively utilizes the essential attributes of video saliency prediction and video salient object detection.
To the best of our knowledge, this is the first work that explores designing a transformer structure for both saliency modeling tasks.
- Score: 9.063895463649414
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video saliency prediction and detection are thriving research domains that
enable computers to simulate the distribution of visual attention akin to how
humans perceiving dynamic scenes. While many approaches have crafted
task-specific training paradigms for either video saliency prediction or video
salient object detection tasks, few attention has been devoted to devising a
generalized saliency modeling framework that seamlessly bridges both these
distinct tasks. In this study, we introduce the Unified Saliency Transformer
(UniST) framework, which comprehensively utilizes the essential attributes of
video saliency prediction and video salient object detection. In addition to
extracting representations of frame sequences, a saliency-aware transformer is
designed to learn the spatio-temporal representations at progressively
increased resolutions, while incorporating effective cross-scale saliency
information to produce a robust representation. Furthermore, a task-specific
decoder is proposed to perform the final prediction for each task. To the best
of our knowledge, this is the first work that explores designing a transformer
structure for both saliency modeling tasks. Convincible experiments demonstrate
that the proposed UniST achieves superior performance across seven challenging
benchmarks for two tasks, and significantly outperforms the other
state-of-the-art methods.
Related papers
- Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases [69.46487306858789]
Conditional Autoregressive Slot Attention (CA-SA) is a framework that enhances the temporal consistency of extracted object-centric representations in video-centric vision tasks.
We present qualitative and quantitative results showing that our proposed method outperforms the considered baselines on downstream tasks.
arXiv Detail & Related papers (2024-10-21T07:44:44Z) - DeTra: A Unified Model for Object Detection and Trajectory Forecasting [68.85128937305697]
Our approach formulates the union of the two tasks as a trajectory refinement problem.
To tackle this unified task, we design a refinement transformer that infers the presence, pose, and multi-modal future behaviors of objects.
In our experiments, we observe that ourmodel outperforms the state-of-the-art on Argoverse 2 Sensor and Open dataset.
arXiv Detail & Related papers (2024-06-06T18:12:04Z) - TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection [23.73648235283315]
Task-oriented object detection aims to find objects suitable for accomplishing specific tasks.
Recent solutions are mainly all-in-one models.
We propose TaskCLIP, a more natural two-stage design composed of general object detection and task-guided object selection.
arXiv Detail & Related papers (2024-03-12T22:33:02Z) - What Matters When Repurposing Diffusion Models for General Dense Perception Tasks? [49.84679952948808]
Recent works show promising results by simply fine-tuning T2I diffusion models for dense perception tasks.
We conduct a thorough investigation into critical factors that affect transfer efficiency and performance when using diffusion priors.
Our work culminates in the development of GenPercept, an effective deterministic one-step fine-tuning paradigm tailed for dense visual perception tasks.
arXiv Detail & Related papers (2024-03-10T04:23:24Z) - What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - A Dynamic Feature Interaction Framework for Multi-task Visual Perception [100.98434079696268]
We devise an efficient unified framework to solve multiple common perception tasks.
These tasks include instance segmentation, semantic segmentation, monocular 3D detection, and depth estimation.
Our proposed framework, termed D2BNet, demonstrates a unique approach to parameter-efficient predictions for multi-task perception.
arXiv Detail & Related papers (2023-06-08T09:24:46Z) - STDepthFormer: Predicting Spatio-temporal Depth from Video with a
Self-supervised Transformer Model [0.0]
Self-supervised model simultaneously predicts a sequence of future frames from video-input with a spatial-temporal attention network is proposed.
The proposed model leverages prior scene knowledge such as object shape and texture similar to single-image depth inference methods.
It is implicitly capable of forecasting the motion of objects in the scene, rather than requiring complex models involving multi-object detection, segmentation and tracking.
arXiv Detail & Related papers (2023-03-02T12:22:51Z) - A Study on Self-Supervised Object Detection Pretraining [14.38896715041483]
We study different approaches to self-supervised pretraining of object detection models.
We first design a general framework to learn a spatially consistent dense representation from an image.
We study existing design choices in the literature, such as box generation, feature extraction strategies, and using multiple views.
arXiv Detail & Related papers (2022-07-09T03:30:44Z) - Recent Trends in 2D Object Detection and Applications in Video Event
Recognition [0.76146285961466]
We discuss the pioneering works in object detection, followed by the recent breakthroughs that employ deep learning.
We highlight recent datasets for 2D object detection both in images and videos, and present a comparative performance summary of various state-of-the-art object detection techniques.
arXiv Detail & Related papers (2022-02-07T14:15:11Z) - Self-Supervision by Prediction for Object Discovery in Videos [62.87145010885044]
In this paper, we use the prediction task as self-supervision and build a novel object-centric model for image sequence representation.
Our framework can be trained without the help of any manual annotation or pretrained network.
Initial experiments confirm that the proposed pipeline is a promising step towards object-centric video prediction.
arXiv Detail & Related papers (2021-03-09T19:14:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.