Related papers: VPN: Learning Video-Pose Embedding for Activities of Daily Living

VPN: Learning Video-Pose Embedding for Activities of Daily Living

URL: http://arxiv.org/abs/2007.03056v1
Date: Mon, 6 Jul 2020 20:39:08 GMT
Title: VPN: Learning Video-Pose Embedding for Activities of Daily Living
Authors: Srijan Das, Saurav Sharma, Rui Dai, Francois Bremond, Monique Thonnat
Abstract summary: Recent 3DNets are too rigid to capture subtle visual patterns across an action. We propose a novel Video-temporal Network: VPN. Experiments show that VPN outperforms the state-of-the-art results for action classification on a large scale human activity dataset.
Score: 6.719751155411075
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we focus on the spatio-temporal aspect of recognizing Activities of Daily Living (ADL). ADL have two specific properties (i) subtle spatio-temporal patterns and (ii) similar visual patterns varying with time. Therefore, ADL may look very similar and often necessitate to look at their fine-grained details to distinguish them. Because the recent spatio-temporal 3D ConvNets are too rigid to capture the subtle visual patterns across an action, we propose a novel Video-Pose Network: VPN. The 2 key components of this VPN are a spatial embedding and an attention network. The spatial embedding projects the 3D poses and RGB cues in a common semantic space. This enables the action recognition framework to learn better spatio-temporal features exploiting both modalities. In order to discriminate similar actions, the attention network provides two functionalities - (i) an end-to-end learnable pose backbone exploiting the topology of human body, and (ii) a coupler to provide joint spatio-temporal attention weights across a video. Experiments show that VPN outperforms the state-of-the-art results for action classification on a large scale human activity dataset: NTU-RGB+D 120, its subset NTU-RGB+D 60, a real-world challenging human activity dataset: Toyota Smarthome and a small scale human-object interaction dataset Northwestern UCLA.

Related papers

Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment [55.11291053011696]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited. To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy. In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark.
arXiv Detail & Related papers (2023-12-01T15:47:04Z)
Improving Video Violence Recognition with Human Interaction Learning on 3D Skeleton Point Clouds [88.87985219999764]
We develop a method for video violence recognition from a new perspective of skeleton points. We first formulate 3D skeleton point clouds from human sequences extracted from videos. We then perform interaction learning on these 3D skeleton point clouds.
arXiv Detail & Related papers (2023-08-26T12:55:18Z)
Grounding 3D Object Affordance from 2D Interactions in Images [128.6316708679246]
Grounding 3D object affordance seeks to locate objects' ''action possibilities'' regions in the 3D space. Humans possess the ability to perceive object affordances in the physical world through demonstration images or videos. We devise an Interaction-driven 3D Affordance Grounding Network (IAG), which aligns the region feature of objects from different sources.
arXiv Detail & Related papers (2023-03-18T15:37:35Z)
VPN++: Rethinking Video-Pose embeddings for understanding Activities of Daily Living [8.765045867163648]
We propose an extension of a pose driven attention mechanism: Video-Pose Network (VPN) We show that VPN++ is not only effective but also provides a high speed up and high resilience to noisy Poses.
arXiv Detail & Related papers (2021-05-17T20:19:47Z)
Spatial-Temporal Correlation and Topology Learning for Person Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation. CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body. It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z)
Decoupled Spatial Temporal Graphs for Generic Visual Grounding [120.66884671951237]
This work investigates into a more general setting, generic visual grounding, aiming to mine all the objects satisfying the given expression. We propose a simple yet effective approach, named DSTG, which commits to 1) decomposing the spatial and temporal representations to collect all-sided cues for precise grounding. We further elaborate a new video dataset, GVG, that consists of challenging referring cases with far-ranging videos.
arXiv Detail & Related papers (2021-03-18T11:56:29Z)
Coarse Temporal Attention Network (CTA-Net) for Driver's Activity Recognition [14.07119502083967]
Driver's activities are different since they are executed by the same subject with similar body parts movements, resulting in subtle changes. Our model is named Coarse Temporal Attention Network (CTA-Net), in which coarse temporal branches are introduced in a trainable glimpse. The model then uses an innovative attention mechanism to generate high-level action specific contextual information for activity recognition.
arXiv Detail & Related papers (2021-01-17T10:15:37Z)
Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics [74.6968179473212]
This paper proposes a novel pretext task to address the self-supervised learning problem. We compute a series of partitioning-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion. A neural network is built and trained to yield the statistical summaries given the video frames as inputs.
arXiv Detail & Related papers (2020-08-31T08:31:56Z)
Depth Based Semantic Scene Completion with Position Importance Aware Loss [52.06051681324545]
PALNet is a novel hybrid network for semantic scene completion. It extracts both 2D and 3D features from multi-stages using fine-grained depth information. It is beneficial for recovering key details like the boundaries of objects and the corners of the scene.
arXiv Detail & Related papers (2020-01-29T07:05:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.