Semantic Segmentation Enhanced Transformer Model for Human Attention
Prediction
- URL: http://arxiv.org/abs/2301.11022v1
- Date: Thu, 26 Jan 2023 10:27:51 GMT
- Title: Semantic Segmentation Enhanced Transformer Model for Human Attention
Prediction
- Authors: Shuo Zhang
- Abstract summary: Saliency Prediction aims to predict the attention distribution of human eyes given an RGB image.
Most of the recent state-of-the-art methods are based on deep image feature representations from traditional CNNs.
We propose a Transformer-based method with semantic segmentation as another learning objective.
- Score: 8.47446520519624
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Saliency Prediction aims to predict the attention distribution of human eyes
given an RGB image. Most of the recent state-of-the-art methods are based on
deep image feature representations from traditional CNNs. However, the
traditional convolution could not capture the global features of the image well
due to its small kernel size. Besides, the high-level factors which closely
correlate to human visual perception, e.g., objects, color, light, etc., are
not considered. Inspired by these, we propose a Transformer-based method with
semantic segmentation as another learning objective. More global cues of the
image could be captured by Transformer. In addition, simultaneously learning
the object segmentation simulates the human visual perception, which we would
verify in our investigation of human gaze control in cognitive science. We
build an extra decoder for the subtask and the multiple tasks share the same
Transformer encoder, forcing it to learn from multiple feature spaces. We find
in practice simply adding the subtask might confuse the main task learning,
hence Multi-task Attention Module is proposed to deal with the feature
interaction between the multiple learning targets. Our method achieves
competitive performance compared to other state-of-the-art methods.
Related papers
- Disentangled Interaction Representation for One-Stage Human-Object
Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding.
Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction.
Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z) - Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - ConvTransSeg: A Multi-resolution Convolution-Transformer Network for
Medical Image Segmentation [14.485482467748113]
We propose a hybrid encoder-decoder segmentation model (ConvTransSeg)
It consists of a multi-layer CNN as the encoder for feature learning and the corresponding multi-level Transformer as the decoder for segmentation prediction.
Our method achieves the best performance in terms of Dice coefficient and average symmetric surface distance measures with low model complexity and memory consumption.
arXiv Detail & Related papers (2022-10-13T14:59:23Z) - A domain adaptive deep learning solution for scanpath prediction of
paintings [66.46953851227454]
This paper focuses on the eye-movement analysis of viewers during the visual experience of a certain number of paintings.
We introduce a new approach to predicting human visual attention, which impacts several cognitive functions for humans.
The proposed new architecture ingests images and returns scanpaths, a sequence of points featuring a high likelihood of catching viewers' attention.
arXiv Detail & Related papers (2022-09-22T22:27:08Z) - Neural Novel Actor: Learning a Generalized Animatable Neural
Representation for Human Actors [98.24047528960406]
We propose a new method for learning a generalized animatable neural representation from a sparse set of multi-view imagery of multiple persons.
The learned representation can be used to synthesize novel view images of an arbitrary person from a sparse set of cameras, and further animate them with the user's pose control.
arXiv Detail & Related papers (2022-08-25T07:36:46Z) - Visual Transformer for Object Detection [0.0]
We consider the use of self-attention for discriminative visual tasks, object detection, as an alternative to convolutions.
Our model leads to consistent improvements in object detection on COCO across many different models and scales.
arXiv Detail & Related papers (2022-06-01T06:13:09Z) - The Devil is in the Frequency: Geminated Gestalt Autoencoder for
Self-Supervised Visual Pre-Training [13.087987450384036]
We present a new Masked Image Modeling (MIM), termed Geminated Autoencoder (Ge$2$-AE) for visual pre-training.
Specifically, we equip our model with geminated decoders in charge of reconstructing image contents from both pixel and frequency space.
arXiv Detail & Related papers (2022-04-18T09:22:55Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Self-Supervised Representation Learning from Flow Equivariance [97.13056332559526]
We present a new self-supervised learning representation framework that can be directly deployed on a video stream of complex scenes.
Our representations, learned from high-resolution raw video, can be readily used for downstream tasks on static images.
arXiv Detail & Related papers (2021-01-16T23:44:09Z) - Combining Semantic Guidance and Deep Reinforcement Learning For
Generating Human Level Paintings [22.889059874754242]
Generation of stroke-based non-photorealistic imagery is an important problem in the computer vision community.
Previous methods have been limited to datasets with little variation in position, scale and saliency of the foreground object.
We propose a Semantic Guidance pipeline with 1) a bi-level painting procedure for learning the distinction between foreground and background brush strokes at training time.
arXiv Detail & Related papers (2020-11-25T09:00:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.