Knowledge Integration Networks for Action Recognition
- URL: http://arxiv.org/abs/2002.07471v1
- Date: Tue, 18 Feb 2020 10:20:30 GMT
- Title: Knowledge Integration Networks for Action Recognition
- Authors: Shiwen Zhang and Sheng Guo and Limin Wang and Weilin Huang and Matthew
R. Scott
- Abstract summary: We design a three-branch architecture consisting of a main branch for action recognition, and two auxiliary branches for human parsing and scene recognition.
We propose a two-level knowledge encoding mechanism which contains a Cross Branch Integration (CBI) module for encoding the auxiliary knowledge into medium-level convolutional features, and an Action Knowledge Graph (AKG) for effectively fusing high-level context information.
The proposed KINet achieves the state-of-the-art performance on a large-scale action recognition benchmark Kinetics-400, with a top-1 accuracy of 77.8%.
- Score: 58.548331848942865
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we propose Knowledge Integration Networks (referred as KINet)
for video action recognition. KINet is capable of aggregating meaningful
context features which are of great importance to identifying an action, such
as human information and scene context. We design a three-branch architecture
consisting of a main branch for action recognition, and two auxiliary branches
for human parsing and scene recognition which allow the model to encode the
knowledge of human and scene for action recognition. We explore two pre-trained
models as teacher networks to distill the knowledge of human and scene for
training the auxiliary tasks of KINet. Furthermore, we propose a two-level
knowledge encoding mechanism which contains a Cross Branch Integration (CBI)
module for encoding the auxiliary knowledge into medium-level convolutional
features, and an Action Knowledge Graph (AKG) for effectively fusing high-level
context information. This results in an end-to-end trainable framework where
the three tasks can be trained collaboratively, allowing the model to compute
strong context knowledge efficiently. The proposed KINet achieves the
state-of-the-art performance on a large-scale action recognition benchmark
Kinetics-400, with a top-1 accuracy of 77.8%. We further demonstrate that our
KINet has strong capability by transferring the Kinetics-trained model to
UCF-101, where it obtains 97.8% top-1 accuracy.
Related papers
- Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z) - A Hierarchical Graph-based Approach for Recognition and Description
Generation of Bimanual Actions in Videos [3.7486111821201287]
This study describes a novel method, integrating graph based modeling with layered hierarchical attention mechanisms.
The complexity of our approach is empirically tested using several 2D and 3D datasets.
arXiv Detail & Related papers (2023-10-01T13:45:48Z) - Conditioning Covert Geo-Location (CGL) Detection on Semantic Class
Information [5.660207256468971]
Task for identification of potential hideouts termed Covert Geo-Location (CCGL) detection was proposed by Saha et al.
No attempts were made to utilize semantic class information, which is crucial for obscured detection.
In this paper, we propose a multitask-learning-based approach to achieve 2 goals - i) extraction of features having semantic class information; ii) robust training of the common encoder, exploiting large standard annotated datasets as training set for the auxiliary task (semantic segmentation).
arXiv Detail & Related papers (2022-11-27T07:21:59Z) - Impact of a DCT-driven Loss in Attention-based Knowledge-Distillation
for Scene Recognition [64.29650787243443]
We propose and analyse the use of a 2D frequency transform of the activation maps before transferring them.
This strategy enhances knowledge transferability in tasks such as scene recognition.
We publicly release the training and evaluation framework used along this paper at http://www.vpu.eps.uam.es/publications/DCTBasedKDForSceneRecognition.
arXiv Detail & Related papers (2022-05-04T11:05:18Z) - Joint-bone Fusion Graph Convolutional Network for Semi-supervised
Skeleton Action Recognition [65.78703941973183]
We propose a novel correlation-driven joint-bone fusion graph convolutional network (CD-JBF-GCN) as an encoder and use a pose prediction head as a decoder.
Specifically, the CD-JBF-GC can explore the motion transmission between the joint stream and the bone stream.
The pose prediction based auto-encoder in the self-supervised training stage allows the network to learn motion representation from unlabeled data.
arXiv Detail & Related papers (2022-02-08T16:03:15Z) - Knowledge Graph Augmented Network Towards Multiview Representation
Learning for Aspect-based Sentiment Analysis [96.53859361560505]
We propose a knowledge graph augmented network (KGAN) to incorporate external knowledge with explicitly syntactic and contextual information.
KGAN captures the sentiment feature representations from multiple perspectives, i.e., context-, syntax- and knowledge-based.
Experiments on three popular ABSA benchmarks demonstrate the effectiveness and robustness of our KGAN.
arXiv Detail & Related papers (2022-01-13T08:25:53Z) - Joint Learning On The Hierarchy Representation for Fine-Grained Human
Action Recognition [13.088129408377918]
Fine-grained human action recognition is a core research topic in computer vision.
We propose a novel multi-task network which exploits the FineGym hierarchy representation to achieve effective joint learning and prediction.
Our results on the FineGym dataset achieve a new state-of-the-art performance, with 91.80% Top-1 accuracy and 88.46% mean accuracy for element actions.
arXiv Detail & Related papers (2021-10-12T09:37:51Z) - Hierarchical Self-supervised Augmented Knowledge Distillation [1.9355744690301404]
We propose an alternative self-supervised augmented task to guide the network to learn the joint distribution of the original recognition task and self-supervised auxiliary task.
It is demonstrated as a richer knowledge to improve the representation power without losing the normal classification capability.
Our method significantly surpasses the previous SOTA SSKD with an average improvement of 2.56% on CIFAR-100 and an improvement of 0.77% on ImageNet.
arXiv Detail & Related papers (2021-07-29T02:57:21Z) - All About Knowledge Graphs for Actions [82.39684757372075]
We propose a better understanding of knowledge graphs (KGs) that can be utilized for zero-shot and few-shot action recognition.
We study three different construction mechanisms for KGs: action embeddings, action-object embeddings, visual embeddings.
We present extensive analysis of the impact of different KGs on different experimental setups.
arXiv Detail & Related papers (2020-08-28T01:44:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.