Modelling Spatio-Temporal Interactions for Compositional Action
Recognition
- URL: http://arxiv.org/abs/2305.02673v1
- Date: Thu, 4 May 2023 09:37:45 GMT
- Title: Modelling Spatio-Temporal Interactions for Compositional Action
Recognition
- Authors: Ramanathan Rajendiran, Debaditya Roy, Basura Fernando
- Abstract summary: Humans have the natural ability to recognize actions even if the objects involved in the action or the background are changed.
We show the effectiveness of our interaction-centric approach on the compositional Something-Else dataset.
Our approach of explicit human-object-stuff interaction modeling is effective even for standard action recognition datasets.
- Score: 21.8767024220287
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans have the natural ability to recognize actions even if the objects
involved in the action or the background are changed. Humans can abstract away
the action from the appearance of the objects and their context which is
referred to as compositionality of actions. Compositional action recognition
deals with imparting human-like compositional generalization abilities to
action-recognition models. In this regard, extracting the interactions between
humans and objects forms the basis of compositional understanding. These
interactions are not affected by the appearance biases of the objects or the
context. But the context provides additional cues about the interactions
between things and stuff. Hence we need to infuse context into the human-object
interactions for compositional action recognition. To this end, we first design
a spatial-temporal interaction encoder that captures the human-object (things)
interactions. The encoder learns the spatio-temporal interaction tokens
disentangled from the background context. The interaction tokens are then
infused with contextual information from the video tokens to model the
interactions between things and stuff. The final context-infused
spatio-temporal interaction tokens are used for compositional action
recognition. We show the effectiveness of our interaction-centric approach on
the compositional Something-Else dataset where we obtain a new state-of-the-art
result of 83.8% top-1 accuracy outperforming recent important object-centric
methods by a significant margin. Our approach of explicit human-object-stuff
interaction modeling is effective even for standard action recognition datasets
such as Something-Something-V2 and Epic-Kitchens-100 where we obtain comparable
or better performance than state-of-the-art.
Related papers
- Visual-Geometric Collaborative Guidance for Affordance Learning [63.038406948791454]
We propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues.
Our method outperforms the representative models regarding objective metrics and visual quality.
arXiv Detail & Related papers (2024-10-15T07:35:51Z) - LEMON: Learning 3D Human-Object Interaction Relation from 2D Images [56.6123961391372]
Learning 3D human-object interaction relation is pivotal to embodied AI and interaction modeling.
Most existing methods approach the goal by learning to predict isolated interaction elements.
We present LEMON, a unified model that mines interaction intentions of the counterparts and employs curvatures to guide the extraction of geometric correlations.
arXiv Detail & Related papers (2023-12-14T14:10:57Z) - Disentangled Interaction Representation for One-Stage Human-Object
Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding.
Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction.
Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z) - InterTracker: Discovering and Tracking General Objects Interacting with
Hands in the Wild [40.489171608114574]
Existing methods rely on frame-based detectors to locate interacting objects.
We propose to leverage hand-object interaction to track interactive objects.
Our proposed method outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2023-08-06T09:09:17Z) - Full-Body Articulated Human-Object Interaction [61.01135739641217]
CHAIRS is a large-scale motion-captured f-AHOI dataset consisting of 16.2 hours of versatile interactions.
CHAIRS provides 3D meshes of both humans and articulated objects during the entire interactive process.
By learning the geometrical relationships in HOI, we devise the very first model that leverage human pose estimation.
arXiv Detail & Related papers (2022-12-20T19:50:54Z) - Compositional Human-Scene Interaction Synthesis with Semantic Control [16.93177243590465]
We aim to synthesize humans interacting with a given 3D scene controlled by high-level semantic specifications.
We design a novel transformer-based generative model, in which the articulated 3D human body surface points and 3D objects are jointly encoded.
Inspired by the compositional nature of interactions that humans can simultaneously interact with multiple objects, we define interaction semantics as the composition of varying numbers of atomic action-object pairs.
arXiv Detail & Related papers (2022-07-26T11:37:44Z) - Skeleton-Based Mutually Assisted Interacted Object Localization and
Human Action Recognition [111.87412719773889]
We propose a joint learning framework for "interacted object localization" and "human action recognition" based on skeleton data.
Our method achieves the best or competitive performance with the state-of-the-art methods for human action recognition.
arXiv Detail & Related papers (2021-10-28T10:09:34Z) - Human Interaction Recognition Framework based on Interacting Body Part
Attention [24.913372626903648]
We propose a novel framework that simultaneously considers both implicit and explicit representations of human interactions.
The proposed method captures the subtle difference between different interactions using interacting body part attention.
We validate the effectiveness of the proposed method using four widely used public datasets.
arXiv Detail & Related papers (2021-01-22T06:52:42Z) - Human and Machine Action Prediction Independent of Object Information [1.0806206850043696]
We study the role of inter-object relations that change during an action.
We predict actions in, on average, less than 64% of the action's duration.
arXiv Detail & Related papers (2020-04-22T12:13:25Z) - Learning Human-Object Interaction Detection using Interaction Points [140.0200950601552]
We propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs.
Our network predicts interaction points, which directly localize and classify the inter-action.
Experiments are performed on two popular benchmarks: V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-31T08:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.