LIGHTEN: Learning Interactions with Graph and Hierarchical TEmporal
Networks for HOI in videos
- URL: http://arxiv.org/abs/2012.09402v1
- Date: Thu, 17 Dec 2020 05:44:07 GMT
- Title: LIGHTEN: Learning Interactions with Graph and Hierarchical TEmporal
Networks for HOI in videos
- Authors: Sai Praneeth Reddy Sunkesula, Rishabh Dabral, Ganesh Ramakrishnan
- Abstract summary: Analyzing the interactions between humans and objects from a video includes identification of relationships between humans and the objects present in the video.
We present a hierarchical approach, LIGHTEN, to learn visual features to effectively capture truth at multiple granularities in a video.
We achieve state-of-the-art results in human-object interaction detection (88.9% and 92.6%) and anticipation tasks of CAD-120 and competitive results on image based HOI detection in V-COCO.
- Score: 13.25502885135043
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Analyzing the interactions between humans and objects from a video includes
identification of the relationships between humans and the objects present in
the video. It can be thought of as a specialized version of Visual Relationship
Detection, wherein one of the objects must be a human. While traditional
methods formulate the problem as inference on a sequence of video segments, we
present a hierarchical approach, LIGHTEN, to learn visual features to
effectively capture spatio-temporal cues at multiple granularities in a video.
Unlike current approaches, LIGHTEN avoids using ground truth data like depth
maps or 3D human pose, thus increasing generalization across non-RGBD datasets
as well. Furthermore, we achieve the same using only the visual features,
instead of the commonly used hand-crafted spatial features. We achieve
state-of-the-art results in human-object interaction detection (88.9% and
92.6%) and anticipation tasks of CAD-120 and competitive results on image based
HOI detection in V-COCO dataset, setting a new benchmark for visual features
based approaches. Code for LIGHTEN is available at
https://github.com/praneeth11009/LIGHTEN-Learning-Interactions-with-Graphs-and-Hierarchical-TEmporal -Networks-for-HOI
Related papers
- VrdONE: One-stage Video Visual Relation Detection [30.983521962897477]
Video Visual Relation Detection (VidVRD) focuses on understanding how entities over time and space in videos.
Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation are present and another for determining their temporal boundaries.
We propose VrdONE, a streamlined yet efficacious one-stage model for VidVRD.
arXiv Detail & Related papers (2024-08-18T08:38:20Z) - Spatial Parsing and Dynamic Temporal Pooling networks for Human-Object
Interaction detection [30.896749712316222]
This paper introduces the Spatial Parsing and Dynamic Temporal Pooling (SPDTP) network, which takes the entire video as atemporal graph with human and object nodes as input.
We achieve state-of-the-art performance on CAD-120 and Something-Else dataset.
arXiv Detail & Related papers (2022-06-07T07:26:06Z) - HighlightMe: Detecting Highlights from Human-Centric Videos [52.84233165201391]
We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos.
We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions.
We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods.
arXiv Detail & Related papers (2021-10-05T01:18:15Z) - Spatio-Temporal Interaction Graph Parsing Networks for Human-Object
Interaction Recognition [55.7731053128204]
In given video-based Human-Object Interaction scene, modeling thetemporal relationship between humans and objects are the important cue to understand the contextual information presented in the video.
With the effective-temporal relationship modeling, it is possible not only to uncover contextual information in each frame but also directly capture inter-time dependencies.
The full use of appearance features, spatial location and the semantic information are also the key to improve the video-based Human-Object Interaction recognition performance.
arXiv Detail & Related papers (2021-08-19T11:57:27Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - ConsNet: Learning Consistency Graph for Zero-Shot Human-Object
Interaction Detection [101.56529337489417]
We consider the problem of Human-Object Interaction (HOI) Detection, which aims to locate and recognize HOI instances in the form of human, action, object> in images.
We argue that multi-level consistencies among objects, actions and interactions are strong cues for generating semantic representations of rare or previously unseen HOIs.
Our model takes visual features of candidate human-object pairs and word embeddings of HOI labels as inputs, maps them into visual-semantic joint embedding space and obtains detection results by measuring their similarities.
arXiv Detail & Related papers (2020-08-14T09:11:18Z) - A Graph-based Interactive Reasoning for Human-Object Interaction
Detection [71.50535113279551]
We present a novel graph-based interactive reasoning model called Interactive Graph (abbr. in-Graph) to infer HOIs.
We construct a new framework to assemble in-Graph models for detecting HOIs, namely in-GraphNet.
Our framework is end-to-end trainable and free from costly annotations like human pose.
arXiv Detail & Related papers (2020-07-14T09:29:03Z) - Learning Human-Object Interaction Detection using Interaction Points [140.0200950601552]
We propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs.
Our network predicts interaction points, which directly localize and classify the inter-action.
Experiments are performed on two popular benchmarks: V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-31T08:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.