Free-Form Composition Networks for Egocentric Action Recognition
- URL: http://arxiv.org/abs/2307.06527v2
- Date: Sat, 14 Oct 2023 06:22:30 GMT
- Title: Free-Form Composition Networks for Egocentric Action Recognition
- Authors: Haoran Wang, Qinghua Cheng, Baosheng Yu, Yibing Zhan, Dapeng Tao,
Liang Ding, and Haibin Ling
- Abstract summary: We propose a free-form composition network (FFCN) that can simultaneously learn disentangled verb, preposition, and noun representations.
The proposed FFCN can directly generate new training data samples for rare classes, hence significantly improve action recognition performance.
- Score: 97.02439848145359
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Egocentric action recognition is gaining significant attention in the field
of human action recognition. In this paper, we address data scarcity issue in
egocentric action recognition from a compositional generalization perspective.
To tackle this problem, we propose a free-form composition network (FFCN) that
can simultaneously learn disentangled verb, preposition, and noun
representations, and then use them to compose new samples in the feature space
for rare classes of action videos. First, we use a graph to capture the
spatial-temporal relations among different hand/object instances in each action
video. We thus decompose each action into a set of verb and preposition
spatial-temporal representations using the edge features in the graph. The
temporal decomposition extracts verb and preposition representations from
different video frames, while the spatial decomposition adaptively learns verb
and preposition representations from action-related instances in each frame.
With these spatial-temporal representations of verbs and prepositions, we can
compose new samples for those rare classes in a free-form manner, which is not
restricted to a rigid form of a verb and a noun. The proposed FFCN can directly
generate new training data samples for rare classes, hence significantly
improve action recognition performance. We evaluated our method on three
popular egocentric action recognition datasets, Something-Something V2, H2O,
and EPIC-KITCHENS-100, and the experimental results demonstrate the
effectiveness of the proposed method for handling data scarcity problems,
including long-tailed and few-shot egocentric action recognition.
Related papers
- Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction.
To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism.
Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z) - Early Action Recognition with Action Prototypes [62.826125870298306]
We propose a novel model that learns a prototypical representation of the full action for each class.
We decompose the video into short clips, where a visual encoder extracts features from each clip independently.
Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction.
arXiv Detail & Related papers (2023-12-11T18:31:13Z) - Learning Action-Effect Dynamics from Pairs of Scene-graphs [50.72283841720014]
We propose a novel method that leverages scene-graph representation of images to reason about the effects of actions described in natural language.
Our proposed approach is effective in terms of performance, data efficiency, and generalization capability compared to existing models.
arXiv Detail & Related papers (2022-12-07T03:36:37Z) - Disentangled Action Recognition with Knowledge Bases [77.77482846456478]
We aim to improve the generalization ability of the compositional action recognition model to novel verbs or novel nouns.
Previous work utilizes verb-noun compositional action nodes in the knowledge graph, making it inefficient to scale.
We propose our approach: Disentangled Action Recognition with Knowledge-bases (DARK), which leverages the inherent compositionality of actions.
arXiv Detail & Related papers (2022-07-04T20:19:13Z) - Towards Tokenized Human Dynamics Representation [41.75534387530019]
We study how to segment and cluster videos into recurring temporal patterns in a self-supervised way.
We evaluate the frame-wise representation learning step by Kendall's Tau and the lexicon building step by normalized mutual information and language entropy.
On the AIST++ and PKU-MMD datasets, actons bring significant performance improvements compared to several baselines.
arXiv Detail & Related papers (2021-11-22T18:59:58Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - Egocentric Action Recognition by Video Attention and Temporal Context [83.57475598382146]
We present the submission of Samsung AI Centre Cambridge to the CVPR 2020 EPIC-Kitchens Action Recognition Challenge.
In this challenge, action recognition is posed as the problem of simultaneously predicting a single verb' and noun' class label given an input trimmed video clip.
Our solution achieves strong performance on the challenge metrics without using object-specific reasoning nor extra training data.
arXiv Detail & Related papers (2020-07-03T18:00:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.