Learning Generative Vision Transformer with Energy-Based Latent Space
for Saliency Prediction
- URL: http://arxiv.org/abs/2112.13528v1
- Date: Mon, 27 Dec 2021 06:04:33 GMT
- Title: Learning Generative Vision Transformer with Energy-Based Latent Space
for Saliency Prediction
- Authors: Jing Zhang, Jianwen Xie, Nick Barnes, Ping Li
- Abstract summary: We propose a novel vision transformer with latent variables following an informative energy-based prior for salient object detection.
Both the vision transformer network and the energy-based prior model are jointly trained via Markov chain Monte Carlo-based maximum likelihood estimation.
With the generative vision transformer, we can easily obtain a pixel-wise uncertainty map from an image, which indicates the model confidence in predicting saliency from the image.
- Score: 51.80191416661064
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision transformer networks have shown superiority in many computer vision
tasks. In this paper, we take a step further by proposing a novel generative
vision transformer with latent variables following an informative energy-based
prior for salient object detection. Both the vision transformer network and the
energy-based prior model are jointly trained via Markov chain Monte Carlo-based
maximum likelihood estimation, in which the sampling from the intractable
posterior and prior distributions of the latent variables are performed by
Langevin dynamics. Further, with the generative vision transformer, we can
easily obtain a pixel-wise uncertainty map from an image, which indicates the
model confidence in predicting saliency from the image. Different from the
existing generative models which define the prior distribution of the latent
variables as a simple isotropic Gaussian distribution, our model uses an
energy-based informative prior which can be more expressive to capture the
latent space of the data. We apply the proposed framework to both RGB and RGB-D
salient object detection tasks. Extensive experimental results show that our
framework can achieve not only accurate saliency predictions but also
meaningful uncertainty maps that are consistent with the human perception.
Related papers
- LaVin-DiT: Large Vision Diffusion Transformer [99.98106406059333]
LaVin-DiT is a scalable and unified foundation model designed to tackle over 20 computer vision tasks in a generative framework.
We introduce key innovations to optimize generative performance for vision tasks.
The model is scaled from 0.1B to 3.4B parameters, demonstrating substantial scalability and state-of-the-art performance across diverse vision tasks.
arXiv Detail & Related papers (2024-11-18T12:05:27Z) - Variational Potential Flow: A Novel Probabilistic Framework for Energy-Based Generative Modelling [10.926841288976684]
We present a novel energy-based generative framework, Variational Potential Flow (VAPO)
VAPO aims to learn a potential energy function whose gradient (flow) guides the prior samples, so that their density evolution closely follows an approximate data likelihood homotopy.
Images can be generated after training the potential energy, by initializing the samples from Gaussian prior and solving the ODE governing the potential flow on a fixed time interval.
arXiv Detail & Related papers (2024-07-21T18:08:12Z) - An Energy-Based Prior for Generative Saliency [62.79775297611203]
We propose a novel generative saliency prediction framework that adopts an informative energy-based model as a prior distribution.
With the generative saliency model, we can obtain a pixel-wise uncertainty map from an image, indicating model confidence in the saliency prediction.
Experimental results show that our generative saliency model with an energy-based prior can achieve not only accurate saliency predictions but also reliable uncertainty maps consistent with human perception.
arXiv Detail & Related papers (2022-04-19T10:51:00Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - Transformer Transforms Salient Object Detection and Camouflaged Object
Detection [43.79585695098729]
We conduct research on applying the transformer networks for salient object detection (SOD)
Specifically, we adopt the dense transformer backbone for fully supervised RGB image based SOD, RGB-D image pair based SOD, and weakly supervised SOD via scribble supervision.
As an extension, we also apply our fully supervised model to the task of camouflaged object detection (COD) for camouflaged object segmentation.
arXiv Detail & Related papers (2021-04-20T17:12:51Z) - Remote sensing image fusion based on Bayesian GAN [9.852262451235472]
We build a two-stream generator network with PAN and MS images as input, which consists of three parts: feature extraction, feature fusion and image reconstruction.
We leverage Markov discriminator to enhance the ability of generator to reconstruct the fusion image, so that the result image can retain more details.
Experiments on QuickBird and WorldView datasets show that the model proposed in this paper can effectively fuse PAN and MS images.
arXiv Detail & Related papers (2020-09-20T16:15:51Z) - Uncertainty Inspired RGB-D Saliency Detection [70.50583438784571]
We propose the first framework to employ uncertainty for RGB-D saliency detection by learning from the data labeling process.
Inspired by the saliency data labeling process, we propose a generative architecture to achieve probabilistic RGB-D saliency detection.
Results on six challenging RGB-D benchmark datasets show our approach's superior performance in learning the distribution of saliency maps.
arXiv Detail & Related papers (2020-09-07T13:01:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.