VPN++: Rethinking Video-Pose embeddings for understanding Activities of
Daily Living
- URL: http://arxiv.org/abs/2105.08141v1
- Date: Mon, 17 May 2021 20:19:47 GMT
- Title: VPN++: Rethinking Video-Pose embeddings for understanding Activities of
Daily Living
- Authors: Srijan Das, Rui Dai, Di Yang, Francois Bremond
- Abstract summary: We propose an extension of a pose driven attention mechanism: Video-Pose Network (VPN)
We show that VPN++ is not only effective but also provides a high speed up and high resilience to noisy Poses.
- Score: 8.765045867163648
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many attempts have been made towards combining RGB and 3D poses for the
recognition of Activities of Daily Living (ADL). ADL may look very similar and
often necessitate to model fine-grained details to distinguish them. Because
the recent 3D ConvNets are too rigid to capture the subtle visual patterns
across an action, this research direction is dominated by methods combining RGB
and 3D Poses. But the cost of computing 3D poses from RGB stream is high in the
absence of appropriate sensors. This limits the usage of aforementioned
approaches in real-world applications requiring low latency. Then, how to best
take advantage of 3D Poses for recognizing ADL? To this end, we propose an
extension of a pose driven attention mechanism: Video-Pose Network (VPN),
exploring two distinct directions. One is to transfer the Pose knowledge into
RGB through a feature-level distillation and the other towards mimicking pose
driven attention through an attention-level distillation. Finally, these two
approaches are integrated into a single model, we call VPN++. We show that
VPN++ is not only effective but also provides a high speed up and high
resilience to noisy Poses. VPN++, with or without 3D Poses, outperforms the
representative baselines on 4 public datasets. Code is available at
https://github.com/srijandas07/vpnplusplus.
Related papers
- Tackling View-Dependent Semantics in 3D Language Gaussian Splatting [80.88015191411714]
LaGa establishes cross-view semantic connections by decomposing the 3D scene into objects.<n>It constructs view-aggregated semantic representations by clustering semantic descriptors and reweighting them based on multi-view semantics.<n>Under the same settings, LaGa achieves a significant improvement of +18.7% mIoU over the previous SOTA on the LERF-OVS dataset.
arXiv Detail & Related papers (2025-05-30T16:06:32Z) - VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding [57.04804711488706]
3D visual grounding is crucial for robots, requiring integration of natural language and 3D scene understanding.
We present VLM-Grounder, a novel framework using vision-language models (VLMs) for zero-shot 3D visual grounding based solely on 2D images.
arXiv Detail & Related papers (2024-10-17T17:59:55Z) - Just Add $\pi$! Pose Induced Video Transformers for Understanding
Activities of Daily Living [9.370655190768163]
We introduce PI-ViT, a novel approach that augments the RGB representations learned by video transformers with 2D and 3D pose information.
$pi$-ViT achieves the state-of-the-art performance on three prominent ADL datasets.
arXiv Detail & Related papers (2023-11-30T18:59:56Z) - Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language.
We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z) - Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for
3D Visual Grounding [23.672405624011873]
We propose a module to consolidate the 3D visual stream by 2D clues synthesized from point clouds.
We empirically show their aptitude to boost the quality of the learned visual representations.
Our proposed module, dubbed as Look Around and Refer (LAR), significantly outperforms the state-of-the-art 3D visual grounding techniques on three benchmarks.
arXiv Detail & Related papers (2022-11-25T17:12:08Z) - TANDEM3D: Active Tactile Exploration for 3D Object Recognition [16.548376556543015]
We propose TANDEM3D, a method that applies a co-training framework for 3D object recognition with tactile signals.
TANDEM3D is based on a novel encoder that builds 3D object representation from contact positions and normals using PointNet++.
Our method is trained entirely in simulation and validated with real-world experiments.
arXiv Detail & Related papers (2022-09-19T05:54:26Z) - Multi-View Transformer for 3D Visual Grounding [64.30493173825234]
We propose a Multi-View Transformer (MVT) for 3D visual grounding.
We project the 3D scene to a multi-view space, in which the position information of the 3D scene under different views are modeled simultaneously and aggregated together.
arXiv Detail & Related papers (2022-04-05T12:59:43Z) - Tracking People with 3D Representations [78.97070307547283]
We present a novel approach for tracking multiple people in video.
Unlike past approaches which employ 2D representations, we employ 3D representations of people, located in three-dimensional space.
We find that 3D representations are more effective than 2D representations for tracking in these settings.
arXiv Detail & Related papers (2021-11-15T16:15:21Z) - VPN: Learning Video-Pose Embedding for Activities of Daily Living [6.719751155411075]
Recent 3DNets are too rigid to capture subtle visual patterns across an action.
We propose a novel Video-temporal Network: VPN.
Experiments show that VPN outperforms the state-of-the-art results for action classification on a large scale human activity dataset.
arXiv Detail & Related papers (2020-07-06T20:39:08Z) - VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wild
Environment [80.77351380961264]
We present an approach to estimate 3D poses of multiple people from multiple camera views.
We present an end-to-end solution which operates in the $3$D space, therefore avoids making incorrect decisions in the 2D space.
We propose Pose Regression Network (PRN) to estimate a detailed 3D pose for each proposal.
arXiv Detail & Related papers (2020-04-13T23:50:01Z) - ImVoteNet: Boosting 3D Object Detection in Point Clouds with Image Votes [93.82668222075128]
We propose a 3D detection architecture called ImVoteNet for RGB-D scenes.
ImVoteNet is based on fusing 2D votes in images and 3D votes in point clouds.
We validate our model on the challenging SUN RGB-D dataset.
arXiv Detail & Related papers (2020-01-29T05:09:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.