Interaction-aware Joint Attention Estimation Using People Attributes
- URL: http://arxiv.org/abs/2308.05382v1
- Date: Thu, 10 Aug 2023 06:55:51 GMT
- Title: Interaction-aware Joint Attention Estimation Using People Attributes
- Authors: Chihiro Nakatani, Hiroaki Kawashima, Norimichi Ukita
- Abstract summary: This paper proposes joint attention estimation in a single image.
For the interaction modeling, we propose a novel Transformer-based attention network to encode joint attention as low-dimensional features.
Our method outperforms SOTA methods quantitatively in comparative experiments.
- Score: 6.8603181780291065
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes joint attention estimation in a single image. Different
from related work in which only the gaze-related attributes of people are
independently employed, (I) their locations and actions are also employed as
contextual cues for weighting their attributes, and (ii) interactions among all
of these attributes are explicitly modeled in our method. For the interaction
modeling, we propose a novel Transformer-based attention network to encode
joint attention as low-dimensional features. We introduce a specialized MLP
head with positional embedding to the Transformer so that it predicts pixelwise
confidence of joint attention for generating the confidence heatmap. This
pixelwise prediction improves the heatmap accuracy by avoiding the ill-posed
problem in which the high-dimensional heatmap is predicted from the
low-dimensional features. The estimated joint attention is further improved by
being integrated with general image-based attention estimation. Our method
outperforms SOTA methods quantitatively in comparative experiments. Code:
https://anonymous.4open.science/r/anonymized_codes-ECA4.
Related papers
- DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture.
We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z) - AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation [14.734158936250918]
Short-Term object-interaction Anticipation is fundamental for wearable assistants or human robot interaction to understand user goals.
We improve the performance of STA predictions with two contributions.
First, we propose STAformer, a novel attention-based architecture integrating frame guided temporal pooling, dual image-video attention, and multiscale feature fusion.
Second, we predict interaction hotspots from the observation of hands and object trajectories, increasing confidence in STA predictions localized around the hotspot.
arXiv Detail & Related papers (2024-06-03T10:57:18Z) - Disentangled Interaction Representation for One-Stage Human-Object
Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding.
Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction.
Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z) - Explicit Correspondence Matching for Generalizable Neural Radiance
Fields [49.49773108695526]
We present a new NeRF method that is able to generalize to new unseen scenarios and perform novel view synthesis with as few as two source views.
The explicit correspondence matching is quantified with the cosine similarity between image features sampled at the 2D projections of a 3D point on different views.
Our method achieves state-of-the-art results on different evaluation settings, with the experiments showing a strong correlation between our learned cosine feature similarity and volume density.
arXiv Detail & Related papers (2023-04-24T17:46:01Z) - Adaptive Local-Component-aware Graph Convolutional Network for One-shot
Skeleton-based Action Recognition [54.23513799338309]
We present an Adaptive Local-Component-aware Graph Convolutional Network for skeleton-based action recognition.
Our method provides a stronger representation than the global embedding and helps our model reach state-of-the-art.
arXiv Detail & Related papers (2022-09-21T02:33:07Z) - HHP-Net: A light Heteroscedastic neural network for Head Pose estimation
with uncertainty [2.064612766965483]
We introduce a novel method to estimate the head pose of people in single images starting from a small set of head keypoints.
Our model is simple to implement and more efficient with respect to the state of the art.
arXiv Detail & Related papers (2021-11-02T08:55:45Z) - SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity.
Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism.
We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z) - Explicitly Modeled Attention Maps for Image Classification [35.72763148637619]
Self-attention networks have shown remarkable progress in computer vision tasks such as image classification.
We propose a novel self-attention module with explicitly modeled attention-maps using only a single learnable parameter for low computational overhead.
Our method achieves an accuracy improvement of up to 2.2% over the ResNet-baselines in ImageNet ILSVRC.
arXiv Detail & Related papers (2020-06-14T11:47:09Z) - Attention improves concentration when learning node embeddings [1.2233362977312945]
Given nodes labelled with search query text, we want to predict links to related queries that share products.
Experiments with a range of deep neural architectures show that simple feedforward networks with an attention mechanism perform best for learning embeddings.
We propose an analytically tractable model of query generation, AttEST, that views both products and the query text as vectors embedded in a latent space.
arXiv Detail & Related papers (2020-06-11T21:21:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.