Disentangled Action Recognition with Knowledge Bases
- URL: http://arxiv.org/abs/2207.01708v1
- Date: Mon, 4 Jul 2022 20:19:13 GMT
- Title: Disentangled Action Recognition with Knowledge Bases
- Authors: Zhekun Luo, Shalini Ghosh, Devin Guillory, Keizo Kato, Trevor Darrell,
Huijuan Xu
- Abstract summary: We aim to improve the generalization ability of the compositional action recognition model to novel verbs or novel nouns.
Previous work utilizes verb-noun compositional action nodes in the knowledge graph, making it inefficient to scale.
We propose our approach: Disentangled Action Recognition with Knowledge-bases (DARK), which leverages the inherent compositionality of actions.
- Score: 77.77482846456478
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Action in video usually involves the interaction of human with objects.
Action labels are typically composed of various combinations of verbs and
nouns, but we may not have training data for all possible combinations. In this
paper, we aim to improve the generalization ability of the compositional action
recognition model to novel verbs or novel nouns that are unseen during training
time, by leveraging the power of knowledge graphs. Previous work utilizes
verb-noun compositional action nodes in the knowledge graph, making it
inefficient to scale since the number of compositional action nodes grows
quadratically with respect to the number of verbs and nouns. To address this
issue, we propose our approach: Disentangled Action Recognition with
Knowledge-bases (DARK), which leverages the inherent compositionality of
actions. DARK trains a factorized model by first extracting disentangled
feature representations for verbs and nouns, and then predicting classification
weights using relations in external knowledge graphs. The type constraint
between verb and noun is extracted from external knowledge bases and finally
applied when composing actions. DARK has better scalability in the number of
objects and verbs, and achieves state-of-the-art performance on the Charades
dataset. We further propose a new benchmark split based on the Epic-kitchen
dataset which is an order of magnitude bigger in the numbers of classes and
samples, and benchmark various models on this benchmark.
Related papers
- Controlling Topic-Focus Articulation in Meaning-to-Text Generation using
Graph Neural Networks [8.334427140256606]
We try three different methods for topic-focus articulation (TFA) employing graph neural models for a meaning-to-text generation task.
We propose a novel encoding strategy about node aggregation in graph neural models, which instead of traditional encoding by aggregating adjacent node information, learns node representations by using depth-first search.
arXiv Detail & Related papers (2023-10-03T13:51:01Z) - Free-Form Composition Networks for Egocentric Action Recognition [97.02439848145359]
We propose a free-form composition network (FFCN) that can simultaneously learn disentangled verb, preposition, and noun representations.
The proposed FFCN can directly generate new training data samples for rare classes, hence significantly improve action recognition performance.
arXiv Detail & Related papers (2023-07-13T02:22:09Z) - Conversational Semantic Parsing using Dynamic Context Graphs [68.72121830563906]
We consider the task of conversational semantic parsing over general purpose knowledge graphs (KGs) with millions of entities, and thousands of relation-types.
We focus on models which are capable of interactively mapping user utterances into executable logical forms.
arXiv Detail & Related papers (2023-05-04T16:04:41Z) - Learning Action-Effect Dynamics from Pairs of Scene-graphs [50.72283841720014]
We propose a novel method that leverages scene-graph representation of images to reason about the effects of actions described in natural language.
Our proposed approach is effective in terms of performance, data efficiency, and generalization capability compared to existing models.
arXiv Detail & Related papers (2022-12-07T03:36:37Z) - Representing Videos as Discriminative Sub-graphs for Action Recognition [165.54738402505194]
We introduce a new design of sub-graphs to represent and encode theriminative patterns of each action in the videos.
We present MUlti-scale Sub-Earn Ling (MUSLE) framework that novelly builds space-time graphs and clusters into compact sub-graphs on each scale.
arXiv Detail & Related papers (2022-01-11T16:15:25Z) - NodePiece: Compositional and Parameter-Efficient Representations of
Large Knowledge Graphs [15.289356276538662]
We propose NodePiece, an anchor-based approach to learn a fixed-size entity vocabulary.
In NodePiece, a vocabulary of subword/sub-entity units is constructed from anchor nodes in a graph with known relation types.
Experiments show that NodePiece performs competitively in node classification, link prediction, and relation prediction tasks.
arXiv Detail & Related papers (2021-06-23T03:51:03Z) - Learning Graph Embeddings for Compositional Zero-shot Learning [73.80007492964951]
In compositional zero-shot learning, the goal is to recognize unseen compositions of observed visual primitives states.
We propose a novel graph formulation called Compositional Graph Embedding (CGE) that learns image features and latent representations of visual primitives in an end-to-end manner.
By learning a joint compatibility that encodes semantics between concepts, our model allows for generalization to unseen compositions without relying on an external knowledge base like WordNet.
arXiv Detail & Related papers (2021-02-03T10:11:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.