Learning Asynchronous and Sparse Human-Object Interaction in Videos
- URL: http://arxiv.org/abs/2103.02758v1
- Date: Wed, 3 Mar 2021 23:43:55 GMT
- Title: Learning Asynchronous and Sparse Human-Object Interaction in Videos
- Authors: Romero Morais, Vuong Le, Svetha Venkatesh, Truyen Tran
- Abstract summary: Asynchronous-Sparse Interaction Graph Networks (ASSIGN) is able to automatically detect the structure of interaction events associated with entities in a video scene.
ASSIGN is tested on human-object interaction recognition and shows superior performance in segmenting and labeling of human sub-activities and object affordances from raw videos.
- Score: 56.73059840294019
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human activities can be learned from video. With effective modeling it is
possible to discover not only the action labels but also the temporal
structures of the activities such as the progression of the sub-activities.
Automatically recognizing such structure from raw video signal is a new
capability that promises authentic modeling and successful recognition of
human-object interactions. Toward this goal, we introduce Asynchronous-Sparse
Interaction Graph Networks (ASSIGN), a recurrent graph network that is able to
automatically detect the structure of interaction events associated with
entities in a video scene. ASSIGN pioneers learning of autonomous behavior of
video entities including their dynamic structure and their interaction with the
coexisting neighbors. Entities' lives in our model are asynchronous to those of
others therefore more flexible in adaptation to complex scenarios. Their
interactions are sparse in time hence more faithful to the true underlying
nature and more robust in inference and learning. ASSIGN is tested on
human-object interaction recognition and shows superior performance in
segmenting and labeling of human sub-activities and object affordances from raw
videos. The native ability for discovering temporal structures of the model
also eliminates the dependence on external segmentation that was previously
mandatory for this task.
Related papers
- Visual-Geometric Collaborative Guidance for Affordance Learning [63.038406948791454]
We propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues.
Our method outperforms the representative models regarding objective metrics and visual quality.
arXiv Detail & Related papers (2024-10-15T07:35:51Z) - Spatial Parsing and Dynamic Temporal Pooling networks for Human-Object
Interaction detection [30.896749712316222]
This paper introduces the Spatial Parsing and Dynamic Temporal Pooling (SPDTP) network, which takes the entire video as atemporal graph with human and object nodes as input.
We achieve state-of-the-art performance on CAD-120 and Something-Else dataset.
arXiv Detail & Related papers (2022-06-07T07:26:06Z) - Weakly Supervised Human-Object Interaction Detection in Video via
Contrastive Spatiotemporal Regions [81.88294320397826]
A system does not know what human-object interactions are present in a video as or the actual location of the human and object.
We introduce a dataset comprising over 6.5k videos with human-object interaction that have been curated from sentence captions.
We demonstrate improved performance over weakly supervised baselines adapted to our annotations on our video dataset.
arXiv Detail & Related papers (2021-10-07T15:30:18Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - Spatio-Temporal Interaction Graph Parsing Networks for Human-Object
Interaction Recognition [55.7731053128204]
In given video-based Human-Object Interaction scene, modeling thetemporal relationship between humans and objects are the important cue to understand the contextual information presented in the video.
With the effective-temporal relationship modeling, it is possible not only to uncover contextual information in each frame but also directly capture inter-time dependencies.
The full use of appearance features, spatial location and the semantic information are also the key to improve the video-based Human-Object Interaction recognition performance.
arXiv Detail & Related papers (2021-08-19T11:57:27Z) - Human-like Relational Models for Activity Recognition in Video [8.87742125296885]
Video activity recognition by deep neural networks is impressive for many classes.
Deep neural networks can struggle to learn critical relationships effectively.
We propose a more human-like approach to activity recognition, which interprets a video in sequential temporal phases.
We apply the method to a challenging subset of the something-something dataset and achieve a more robust performance against neural network baselines on challenging activities.
arXiv Detail & Related papers (2021-07-12T11:13:17Z) - Coarse Temporal Attention Network (CTA-Net) for Driver's Activity
Recognition [14.07119502083967]
Driver's activities are different since they are executed by the same subject with similar body parts movements, resulting in subtle changes.
Our model is named Coarse Temporal Attention Network (CTA-Net), in which coarse temporal branches are introduced in a trainable glimpse.
The model then uses an innovative attention mechanism to generate high-level action specific contextual information for activity recognition.
arXiv Detail & Related papers (2021-01-17T10:15:37Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.