Symbiotic Attention with Privileged Information for Egocentric Action
Recognition
- URL: http://arxiv.org/abs/2002.03137v1
- Date: Sat, 8 Feb 2020 10:48:43 GMT
- Title: Symbiotic Attention with Privileged Information for Egocentric Action
Recognition
- Authors: Xiaohan Wang, Yu Wu, Linchao Zhu, Yi Yang
- Abstract summary: We propose a novel Symbiotic Attention framework for egocentric video recognition.
Our framework enables mutual communication among the verb branch, the noun branch, and the privileged information.
Notably, it achieves the state-of-the-art on two large-scale egocentric video datasets.
- Score: 71.0778513390334
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Egocentric video recognition is a natural testbed for diverse interaction
reasoning. Due to the large action vocabulary in egocentric video datasets,
recent studies usually utilize a two-branch structure for action recognition,
ie, one branch for verb classification and the other branch for noun
classification. However, correlation studies between the verb and the noun
branches have been largely ignored. Besides, the two branches fail to exploit
local features due to the absence of a position-aware attention mechanism. In
this paper, we propose a novel Symbiotic Attention framework leveraging
Privileged information (SAP) for egocentric video recognition. Finer
position-aware object detection features can facilitate the understanding of
actor's interaction with the object. We introduce these features in action
recognition and regard them as privileged information. Our framework enables
mutual communication among the verb branch, the noun branch, and the privileged
information. This communication process not only injects local details into
global features but also exploits implicit guidance about the spatio-temporal
position of an on-going action. We introduce novel symbiotic attention (SA) to
enable effective communication. It first normalizes the detection guided
features on one branch to underline the action-relevant information from the
other branch. SA adaptively enhances the interactions among the three sources.
To further catalyze this communication, spatial relations are uncovered for the
selection of most action-relevant information. It identifies the most valuable
and discriminative feature for classification. We validate the effectiveness of
our SAP quantitatively and qualitatively. Notably, it achieves the
state-of-the-art on two large-scale egocentric video datasets.
Related papers
- Dual Relation Mining Network for Zero-Shot Learning [48.89161627050706]
We propose a Dual Relation Mining Network (DRMN) to enable effective visual-semantic interactions and learn semantic relationship among attributes for knowledge transfer.
Specifically, we introduce a Dual Attention Block (DAB) for visual-semantic relationship mining, which enriches visual information by multi-level feature fusion.
For semantic relationship modeling, we utilize a Semantic Interaction Transformer (SIT) to enhance the generalization of attribute representations among images.
arXiv Detail & Related papers (2024-05-06T16:31:19Z) - How to Understand Named Entities: Using Common Sense for News Captioning [34.10048889674029]
News captioning aims to describe an image with its news article body as input.
This paper exploits commonsense knowledge to understand named entities for news captioning.
arXiv Detail & Related papers (2024-03-11T08:52:52Z) - Free-Form Composition Networks for Egocentric Action Recognition [97.02439848145359]
We propose a free-form composition network (FFCN) that can simultaneously learn disentangled verb, preposition, and noun representations.
The proposed FFCN can directly generate new training data samples for rare classes, hence significantly improve action recognition performance.
arXiv Detail & Related papers (2023-07-13T02:22:09Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - Interpretation of Emergent Communication in Heterogeneous Collaborative
Embodied Agents [83.52684405389445]
We introduce the collaborative multi-object navigation task CoMON.
In this task, an oracle agent has detailed environment information in the form of a map.
It communicates with a navigator agent that perceives the environment visually and is tasked to find a sequence of goals.
We show that the emergent communication can be grounded to the agent observations and the spatial structure of the 3D environment.
arXiv Detail & Related papers (2021-10-12T06:56:11Z) - Learning to Recognize Actions on Objects in Egocentric Video with
Attention Dictionaries [51.48859591280838]
We present EgoACO, a deep neural architecture for video action recognition.
It learns to pool action-context-object descriptors from frame level features.
Cap uses self-attention with a dictionary of learnable weights to pool from the most relevant feature regions.
arXiv Detail & Related papers (2021-02-16T10:26:04Z) - Coarse Temporal Attention Network (CTA-Net) for Driver's Activity
Recognition [14.07119502083967]
Driver's activities are different since they are executed by the same subject with similar body parts movements, resulting in subtle changes.
Our model is named Coarse Temporal Attention Network (CTA-Net), in which coarse temporal branches are introduced in a trainable glimpse.
The model then uses an innovative attention mechanism to generate high-level action specific contextual information for activity recognition.
arXiv Detail & Related papers (2021-01-17T10:15:37Z) - Co-GAT: A Co-Interactive Graph Attention Network for Joint Dialog Act
Recognition and Sentiment Classification [34.711179589196355]
In a dialog system, dialog act recognition and sentiment classification are two correlative tasks.
We propose a Co-Interactive Graph Attention Network (Co-GAT) to jointly perform the two tasks.
Experimental results on two public datasets show that our model successfully captures the two sources of information.
arXiv Detail & Related papers (2020-12-24T14:10:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.