Unsupervised Foveal Vision Neural Networks with Top-Down Attention
- URL: http://arxiv.org/abs/2010.09103v1
- Date: Sun, 18 Oct 2020 20:55:49 GMT
- Title: Unsupervised Foveal Vision Neural Networks with Top-Down Attention
- Authors: Ryan Burt, Nina N. Thigpen, Andreas Keil, Jose C. Principe
- Abstract summary: We propose the fusion of bottom-up saliency and top-down attention employing only unsupervised learning techniques.
We test the performance of the proposed Gamma saliency technique on the Toronto and CAT2000 databases.
We also develop a topdown attention mechanism based on the Gamma saliency applied to the top layer of CNNs to improve scene understanding in multi-object images or images with strong background clutter.
- Score: 0.3058685580689604
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning architectures are an extremely powerful tool for recognizing
and classifying images. However, they require supervised learning and normally
work on vectors the size of image pixels and produce the best results when
trained on millions of object images. To help mitigate these issues, we propose
the fusion of bottom-up saliency and top-down attention employing only
unsupervised learning techniques, which helps the object recognition module to
focus on relevant data and learn important features that can later be
fine-tuned for a specific task. In addition, by utilizing only relevant
portions of the data, the training speed can be greatly improved. We test the
performance of the proposed Gamma saliency technique on the Toronto and CAT2000
databases, and the foveated vision in the Street View House Numbers (SVHN)
database. The results in foveated vision show that Gamma saliency is comparable
to the best and computationally faster. The results in SVHN show that our
unsupervised cognitive architecture is comparable to fully supervised methods
and that the Gamma saliency also improves CNN performance if desired. We also
develop a topdown attention mechanism based on the Gamma saliency applied to
the top layer of CNNs to improve scene understanding in multi-object images or
images with strong background clutter. When we compare the results with human
observers in an image dataset of animals occluded in natural scenes, we show
that topdown attention is capable of disambiguating object from background and
improves system performance beyond the level of human observers.
Related papers
- What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner.
Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z) - Supervised and Contrastive Self-Supervised In-Domain Representation
Learning for Dense Prediction Problems in Remote Sensing [0.0]
This paper explores the effectiveness of in-domain representations in both supervised and self-supervised forms to solve the domain difference between remote sensing and the ImageNet dataset.
For self-supervised pre-training, we have utilized the SimSiam algorithm as it is simple and does not need huge computational resources.
Our results have demonstrated that using datasets with a high spatial resolution for self-supervised representation learning leads to high performance in downstream tasks.
arXiv Detail & Related papers (2023-01-29T20:56:51Z) - A domain adaptive deep learning solution for scanpath prediction of
paintings [66.46953851227454]
This paper focuses on the eye-movement analysis of viewers during the visual experience of a certain number of paintings.
We introduce a new approach to predicting human visual attention, which impacts several cognitive functions for humans.
The proposed new architecture ingests images and returns scanpaths, a sequence of points featuring a high likelihood of catching viewers' attention.
arXiv Detail & Related papers (2022-09-22T22:27:08Z) - Exploring CLIP for Assessing the Look and Feel of Images [87.97623543523858]
We introduce Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner.
Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.
arXiv Detail & Related papers (2022-07-25T17:58:16Z) - In-N-Out Generative Learning for Dense Unsupervised Video Segmentation [89.21483504654282]
In this paper, we focus on the unsupervised Video Object (VOS) task which learns visual correspondence from unlabeled videos.
We propose the In-aNd-Out (INO) generative learning from a purely generative perspective, which captures both high-level and fine-grained semantics.
Our INO outperforms previous state-of-the-art methods by significant margins.
arXiv Detail & Related papers (2022-03-29T07:56:21Z) - Hybrid BYOL-ViT: Efficient approach to deal with small Datasets [0.0]
In this paper, we investigate how self-supervision with strong and sufficient augmentation of unlabeled data can train effectively the first layers of a neural network.
We show that the low-level features derived from a self-supervised architecture can improve the robustness and the overall performance of this emergent architecture.
arXiv Detail & Related papers (2021-11-08T21:44:31Z) - Unsupervised Object-Level Representation Learning from Scene Images [97.07686358706397]
Object-level Representation Learning (ORL) is a new self-supervised learning framework towards scene images.
Our key insight is to leverage image-level self-supervised pre-training as the prior to discover object-level semantic correspondence.
ORL significantly improves the performance of self-supervised learning on scene images, even surpassing supervised ImageNet pre-training on several downstream tasks.
arXiv Detail & Related papers (2021-06-22T17:51:24Z) - Efficient Self-supervised Vision Transformers for Representation
Learning [86.57557009109411]
We show that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity.
We propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies.
Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation.
arXiv Detail & Related papers (2021-06-17T19:57:33Z) - A Framework for Fast Scalable BNN Inference using Googlenet and Transfer
Learning [0.0]
This thesis aims to achieve high accuracy in object detection with good real-time performance.
The binarized neural network has shown high performance in various vision tasks such as image classification, object detection, and semantic segmentation.
Results show that the accuracy of objects detected by the transfer learning method is more when compared to the existing methods.
arXiv Detail & Related papers (2021-01-04T06:16:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.