Transformer Transforms Salient Object Detection and Camouflaged Object
Detection
- URL: http://arxiv.org/abs/2104.10127v1
- Date: Tue, 20 Apr 2021 17:12:51 GMT
- Title: Transformer Transforms Salient Object Detection and Camouflaged Object
Detection
- Authors: Yuxin Mao, Jing Zhang, Zhexiong Wan, Yuchao Dai, Aixuan Li, Yunqiu Lv,
Xinyu Tian, Deng-Ping Fan, and Nick Barnes
- Abstract summary: We conduct research on applying the transformer networks for salient object detection (SOD)
Specifically, we adopt the dense transformer backbone for fully supervised RGB image based SOD, RGB-D image pair based SOD, and weakly supervised SOD via scribble supervision.
As an extension, we also apply our fully supervised model to the task of camouflaged object detection (COD) for camouflaged object segmentation.
- Score: 43.79585695098729
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The transformer networks, which originate from machine translation, are
particularly good at modeling long-range dependencies within a long sequence.
Currently, the transformer networks are making revolutionary progress in
various vision tasks ranging from high-level classification tasks to low-level
dense prediction tasks. In this paper, we conduct research on applying the
transformer networks for salient object detection (SOD). Specifically, we adopt
the dense transformer backbone for fully supervised RGB image based SOD, RGB-D
image pair based SOD, and weakly supervised SOD via scribble supervision. As an
extension, we also apply our fully supervised model to the task of camouflaged
object detection (COD) for camouflaged object segmentation. For the fully
supervised models, we define the dense transformer backbone as feature encoder,
and design a very simple decoder to produce a one channel saliency map (or
camouflage map for the COD task). For the weakly supervised model, as there
exists no structure information in the scribble annotation, we first adopt the
recent proposed Gated-CRF loss to effectively model the pair-wise relationships
for accurate model prediction. Then, we introduce self-supervised learning
strategy to push the model to produce scale-invariant predictions, which is
proven effective for weakly supervised models and models trained on small
training datasets. Extensive experimental results on various SOD and COD tasks
(fully supervised RGB image based SOD, fully supervised RGB-D image pair based
SOD, weakly supervised SOD via scribble supervision, and fully supervised RGB
image based COD) illustrate that transformer networks can transform salient
object detection and camouflaged object detection, leading to new benchmarks
for each related task.
Related papers
- VST++: Efficient and Stronger Visual Saliency Transformer [74.26078624363274]
We develop an efficient and stronger VST++ model to explore global long-range dependencies.
We evaluate our model across various transformer-based backbones on RGB, RGB-D, and RGB-T SOD benchmark datasets.
arXiv Detail & Related papers (2023-10-18T05:44:49Z) - Visual Transformer for Object Detection [0.0]
We consider the use of self-attention for discriminative visual tasks, object detection, as an alternative to convolutions.
Our model leads to consistent improvements in object detection on COCO across many different models and scales.
arXiv Detail & Related papers (2022-06-01T06:13:09Z) - BTranspose: Bottleneck Transformers for Human Pose Estimation with
Self-Supervised Pre-Training [0.304585143845864]
In this paper, we consider the recently proposed Bottleneck Transformers, which combine CNN and multi-head self attention (MHSA) layers effectively.
We consider different backbone architectures and pre-train them using the DINO self-supervised learning method.
Experiments show that our model achieves an AP of 76.4, which is competitive with other methods such as [1] and has fewer network parameters.
arXiv Detail & Related papers (2022-04-21T15:45:05Z) - ViDT: An Efficient and Effective Fully Transformer-based Object Detector [97.71746903042968]
Detection transformers are the first fully end-to-end learning systems for object detection.
vision transformers are the first fully transformer-based architecture for image classification.
In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector.
arXiv Detail & Related papers (2021-10-08T06:32:05Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - End-to-End Trainable Multi-Instance Pose Estimation with Transformers [68.93512627479197]
We propose a new end-to-end trainable approach for multi-instance pose estimation by combining a convolutional neural network with a transformer.
Inspired by recent work on end-to-end trainable object detection with transformers, we use a transformer encoder-decoder architecture together with a bipartite matching scheme to directly regress the pose of all individuals in a given image.
Our model, called POse Estimation Transformer (POET), is trained using a novel set-based global loss that consists of a keypoint loss, a keypoint visibility loss, a center loss and a class loss.
arXiv Detail & Related papers (2021-03-22T18:19:22Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.