Regional Attention Network (RAN) for Head Pose and Fine-grained Gesture
Recognition
- URL: http://arxiv.org/abs/2101.06634v1
- Date: Sun, 17 Jan 2021 10:14:28 GMT
- Title: Regional Attention Network (RAN) for Head Pose and Fine-grained Gesture
Recognition
- Authors: Ardhendu Behera, Zachary Wharton, Morteza Ghahremani, Swagat Kumar,
Nik Bessis
- Abstract summary: We propose a novel end-to-end textbfRegional Attention Network (RAN), which is a fully Convolutional Neural Network (CNN)
Our regions consist of one or more consecutive cells and are adapted from the strategies used in computing HOG (Histogram of Oriented Gradient) descriptor.
The proposed approach outperforms the state-of-the-art by a considerable margin in different metrics.
- Score: 9.131161856493486
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Affect is often expressed via non-verbal body language such as
actions/gestures, which are vital indicators for human behaviors. Recent
studies on recognition of fine-grained actions/gestures in monocular images
have mainly focused on modeling spatial configuration of body parts
representing body pose, human-objects interactions and variations in local
appearance. The results show that this is a brittle approach since it relies on
accurate body parts/objects detection. In this work, we argue that there exist
local discriminative semantic regions, whose "informativeness" can be evaluated
by the attention mechanism for inferring fine-grained gestures/actions. To this
end, we propose a novel end-to-end \textbf{Regional Attention Network (RAN)},
which is a fully Convolutional Neural Network (CNN) to combine multiple
contextual regions through attention mechanism, focusing on parts of the images
that are most relevant to a given task. Our regions consist of one or more
consecutive cells and are adapted from the strategies used in computing HOG
(Histogram of Oriented Gradient) descriptor. The model is extensively evaluated
on ten datasets belonging to 3 different scenarios: 1) head pose recognition,
2) drivers state recognition, and 3) human action and facial expression
recognition. The proposed approach outperforms the state-of-the-art by a
considerable margin in different metrics.
Related papers
- Finding Shared Decodable Concepts and their Negations in the Brain [4.111712524255376]
We train a highly accurate contrastive model that maps brain responses during naturalistic image viewing to CLIP embeddings.
We then use a novel adaptation of the DBSCAN clustering algorithm to cluster the parameters of participant-specific contrastive models.
Examining the images most and least associated with each SDC cluster gives us additional insight into the semantic properties of each SDC.
arXiv Detail & Related papers (2024-05-27T21:28:26Z) - Disentangled Interaction Representation for One-Stage Human-Object
Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding.
Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction.
Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z) - Collaborative Feature Learning for Fine-grained Facial Forgery Detection
and Segmentation [56.73855202368894]
Previous work related to forgery detection mostly focuses on the entire faces.
Recent forgery methods have developed to edit important facial components while maintaining others unchanged.
We propose a collaborative feature learning approach to simultaneously detect manipulation and segment the falsified components.
arXiv Detail & Related papers (2023-04-17T08:49:11Z) - Skeletal Human Action Recognition using Hybrid Attention based Graph
Convolutional Network [3.261599248682793]
We propose a new adaptive spatial attention layer that extends local attention map to global based on relative distance and relative angle information.
We design a new initial graph adjacency matrix that connects head, hands and feet, which shows visible improvement in terms of action recognition accuracy.
The proposed model is evaluated on two large-scale and challenging datasets in the field of human activities in daily life.
arXiv Detail & Related papers (2022-07-12T12:22:21Z) - KTN: Knowledge Transfer Network for Learning Multi-person 2D-3D
Correspondences [77.56222946832237]
We present a novel framework to detect the densepose of multiple people in an image.
The proposed method, which we refer to Knowledge Transfer Network (KTN), tackles two main problems.
It simultaneously maintains feature resolution and suppresses background pixels, and this strategy results in substantial increase in accuracy.
arXiv Detail & Related papers (2022-06-21T03:11:37Z) - Head and eye egocentric gesture recognition for human-robot interaction
using eyewear cameras [4.344337854565144]
This work addresses the problem of human gesture recognition.
In particular, we focus on head and eye gestures, and adopt an egocentric (first-person) perspective using eyewear cameras.
A motion-based recognition approach is proposed, which operates at two temporal granularities.
arXiv Detail & Related papers (2022-01-27T13:26:05Z) - Understanding Character Recognition using Visual Explanations Derived
from the Human Visual System and Deep Networks [6.734853055176694]
We examine the congruence, or lack thereof, in the information-gathering strategies of deep neural networks.
The deep learning model considered similar regions in character, which humans have fixated in the case of correctly classified characters.
We propose to use the visual fixation maps obtained from the eye-tracking experiment as a supervisory input to align the model's focus on relevant character regions.
arXiv Detail & Related papers (2021-08-10T10:09:37Z) - Rethinking of the Image Salient Object Detection: Object-level Semantic
Saliency Re-ranking First, Pixel-wise Saliency Refinement Latter [62.26677215668959]
We propose a lightweight, weakly supervised deep network to coarsely locate semantically salient regions.
We then fuse multiple off-the-shelf deep models on these semantically salient regions as the pixel-wise saliency refinement.
Our method is simple yet effective, which is the first attempt to consider the salient object detection mainly as an object-level semantic re-ranking problem.
arXiv Detail & Related papers (2020-08-10T07:12:43Z) - Diagnosing Rarity in Human-Object Interaction Detection [6.129776019898014]
Human-object interaction (HOI) detection is a core task in computer vision.
The goal is to localize all human-object pairs and recognize their interactions.
An interaction defined by a verb, noun> leads to a long-tailed visual recognition challenge.
arXiv Detail & Related papers (2020-06-10T08:35:29Z) - Ventral-Dorsal Neural Networks: Object Detection via Selective Attention [51.79577908317031]
We propose a new framework called Ventral-Dorsal Networks (VDNets)
Inspired by the structure of the human visual system, we propose the integration of a "Ventral Network" and a "Dorsal Network"
Our experimental results reveal that the proposed method outperforms state-of-the-art object detection approaches.
arXiv Detail & Related papers (2020-05-15T23:57:36Z) - Structured Landmark Detection via Topology-Adapting Deep Graph Learning [75.20602712947016]
We present a new topology-adapting deep graph learning approach for accurate anatomical facial and medical landmark detection.
The proposed method constructs graph signals leveraging both local image features and global shape features.
Experiments are conducted on three public facial image datasets (WFLW, 300W, and COFW-68) as well as three real-world X-ray medical datasets (Cephalometric (public), Hand and Pelvis)
arXiv Detail & Related papers (2020-04-17T11:55:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.