Visual Saliency Transformer
- URL: http://arxiv.org/abs/2104.12099v1
- Date: Sun, 25 Apr 2021 08:24:06 GMT
- Title: Visual Saliency Transformer
- Authors: Nian Liu and Ni Zhang and Kaiyuan Wan and Junwei Han and Ling Shao
- Abstract summary: We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
- Score: 127.33678448761599
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, massive saliency detection methods have achieved promising results
by relying on CNN-based architectures. Alternatively, we rethink this task from
a convolution-free sequence-to-sequence perspective and predict saliency by
modeling long-range dependencies, which can not be achieved by convolution.
Specifically, we develop a novel unified model based on a pure transformer,
namely, Visual Saliency Transformer (VST), for both RGB and RGB-D salient
object detection (SOD). It takes image patches as inputs and leverages the
transformer to propagate global contexts among image patches. Apart from the
traditional transformer architecture used in Vision Transformer (ViT), we
leverage multi-level token fusion and propose a new token upsampling method
under the transformer framework to get high-resolution detection results. We
also develop a token-based multi-task decoder to simultaneously perform
saliency and boundary detection by introducing task-related tokens and a novel
patch-task-attention mechanism. Experimental results show that our model
outperforms existing state-of-the-art results on both RGB and RGB-D SOD
benchmark datasets. Most importantly, our whole framework not only provides a
new perspective for the SOD field but also shows a new paradigm for
transformer-based dense prediction models.
Related papers
- VST++: Efficient and Stronger Visual Saliency Transformer [74.26078624363274]
We develop an efficient and stronger VST++ model to explore global long-range dependencies.
We evaluate our model across various transformer-based backbones on RGB, RGB-D, and RGB-T SOD benchmark datasets.
arXiv Detail & Related papers (2023-10-18T05:44:49Z) - SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient
object detection [12.126413875108993]
We propose a cross-modality fusion model SwinNet for RGB-D and RGB-T salient object detection.
The proposed model outperforms the state-of-the-art models on RGB-D and RGB-T datasets.
arXiv Detail & Related papers (2022-04-12T07:37:39Z) - Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks.
We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers.
Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z) - TransCMD: Cross-Modal Decoder Equipped with Transformer for RGB-D
Salient Object Detection [86.94578023985677]
In this work, we rethink this task from the perspective of global information alignment and transformation.
Specifically, the proposed method (TransCMD) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path.
Experimental results on seven RGB-D SOD benchmark datasets demonstrate that a simple two-stream encoder-decoder framework can surpass the state-of-the-art purely CNN-based methods.
arXiv Detail & Related papers (2021-12-04T15:45:34Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z) - Transformer Transforms Salient Object Detection and Camouflaged Object
Detection [43.79585695098729]
We conduct research on applying the transformer networks for salient object detection (SOD)
Specifically, we adopt the dense transformer backbone for fully supervised RGB image based SOD, RGB-D image pair based SOD, and weakly supervised SOD via scribble supervision.
As an extension, we also apply our fully supervised model to the task of camouflaged object detection (COD) for camouflaged object segmentation.
arXiv Detail & Related papers (2021-04-20T17:12:51Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.