VPN: Learning Video-Pose Embedding for Activities of Daily Living
- URL: http://arxiv.org/abs/2007.03056v1
- Date: Mon, 6 Jul 2020 20:39:08 GMT
- Title: VPN: Learning Video-Pose Embedding for Activities of Daily Living
- Authors: Srijan Das, Saurav Sharma, Rui Dai, Francois Bremond, Monique Thonnat
- Abstract summary: Recent 3DNets are too rigid to capture subtle visual patterns across an action.
We propose a novel Video-temporal Network: VPN.
Experiments show that VPN outperforms the state-of-the-art results for action classification on a large scale human activity dataset.
- Score: 6.719751155411075
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we focus on the spatio-temporal aspect of recognizing
Activities of Daily Living (ADL). ADL have two specific properties (i) subtle
spatio-temporal patterns and (ii) similar visual patterns varying with time.
Therefore, ADL may look very similar and often necessitate to look at their
fine-grained details to distinguish them. Because the recent spatio-temporal 3D
ConvNets are too rigid to capture the subtle visual patterns across an action,
we propose a novel Video-Pose Network: VPN. The 2 key components of this VPN
are a spatial embedding and an attention network. The spatial embedding
projects the 3D poses and RGB cues in a common semantic space. This enables the
action recognition framework to learn better spatio-temporal features
exploiting both modalities. In order to discriminate similar actions, the
attention network provides two functionalities - (i) an end-to-end learnable
pose backbone exploiting the topology of human body, and (ii) a coupler to
provide joint spatio-temporal attention weights across a video. Experiments
show that VPN outperforms the state-of-the-art results for action
classification on a large scale human activity dataset: NTU-RGB+D 120, its
subset NTU-RGB+D 60, a real-world challenging human activity dataset: Toyota
Smarthome and a small scale human-object interaction dataset Northwestern UCLA.
Related papers
- Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment [55.11291053011696]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.
To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.
In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark.
arXiv Detail & Related papers (2023-12-01T15:47:04Z) - Improving Video Violence Recognition with Human Interaction Learning on
3D Skeleton Point Clouds [88.87985219999764]
We develop a method for video violence recognition from a new perspective of skeleton points.
We first formulate 3D skeleton point clouds from human sequences extracted from videos.
We then perform interaction learning on these 3D skeleton point clouds.
arXiv Detail & Related papers (2023-08-26T12:55:18Z) - Grounding 3D Object Affordance from 2D Interactions in Images [128.6316708679246]
Grounding 3D object affordance seeks to locate objects' ''action possibilities'' regions in the 3D space.
Humans possess the ability to perceive object affordances in the physical world through demonstration images or videos.
We devise an Interaction-driven 3D Affordance Grounding Network (IAG), which aligns the region feature of objects from different sources.
arXiv Detail & Related papers (2023-03-18T15:37:35Z) - VPN++: Rethinking Video-Pose embeddings for understanding Activities of
Daily Living [8.765045867163648]
We propose an extension of a pose driven attention mechanism: Video-Pose Network (VPN)
We show that VPN++ is not only effective but also provides a high speed up and high resilience to noisy Poses.
arXiv Detail & Related papers (2021-05-17T20:19:47Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Decoupled Spatial Temporal Graphs for Generic Visual Grounding [120.66884671951237]
This work investigates into a more general setting, generic visual grounding, aiming to mine all the objects satisfying the given expression.
We propose a simple yet effective approach, named DSTG, which commits to 1) decomposing the spatial and temporal representations to collect all-sided cues for precise grounding.
We further elaborate a new video dataset, GVG, that consists of challenging referring cases with far-ranging videos.
arXiv Detail & Related papers (2021-03-18T11:56:29Z) - Coarse Temporal Attention Network (CTA-Net) for Driver's Activity
Recognition [14.07119502083967]
Driver's activities are different since they are executed by the same subject with similar body parts movements, resulting in subtle changes.
Our model is named Coarse Temporal Attention Network (CTA-Net), in which coarse temporal branches are introduced in a trainable glimpse.
The model then uses an innovative attention mechanism to generate high-level action specific contextual information for activity recognition.
arXiv Detail & Related papers (2021-01-17T10:15:37Z) - Self-supervised Video Representation Learning by Uncovering
Spatio-temporal Statistics [74.6968179473212]
This paper proposes a novel pretext task to address the self-supervised learning problem.
We compute a series of partitioning-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion.
A neural network is built and trained to yield the statistical summaries given the video frames as inputs.
arXiv Detail & Related papers (2020-08-31T08:31:56Z) - Depth Based Semantic Scene Completion with Position Importance Aware
Loss [52.06051681324545]
PALNet is a novel hybrid network for semantic scene completion.
It extracts both 2D and 3D features from multi-stages using fine-grained depth information.
It is beneficial for recovering key details like the boundaries of objects and the corners of the scene.
arXiv Detail & Related papers (2020-01-29T07:05:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.