Just Add $\pi$! Pose Induced Video Transformers for Understanding
Activities of Daily Living
- URL: http://arxiv.org/abs/2311.18840v1
- Date: Thu, 30 Nov 2023 18:59:56 GMT
- Title: Just Add $\pi$! Pose Induced Video Transformers for Understanding
Activities of Daily Living
- Authors: Dominick Reilly, Srijan Das
- Abstract summary: We introduce PI-ViT, a novel approach that augments the RGB representations learned by video transformers with 2D and 3D pose information.
$pi$-ViT achieves the state-of-the-art performance on three prominent ADL datasets.
- Score: 9.370655190768163
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video transformers have become the de facto standard for human action
recognition, yet their exclusive reliance on the RGB modality still limits
their adoption in certain domains. One such domain is Activities of Daily
Living (ADL), where RGB alone is not sufficient to distinguish between visually
similar actions, or actions observed from multiple viewpoints. To facilitate
the adoption of video transformers for ADL, we hypothesize that the
augmentation of RGB with human pose information, known for its sensitivity to
fine-grained motion and multiple viewpoints, is essential. Consequently, we
introduce the first Pose Induced Video Transformer: PI-ViT (or $\pi$-ViT), a
novel approach that augments the RGB representations learned by video
transformers with 2D and 3D pose information. The key elements of $\pi$-ViT are
two plug-in modules, 2D Skeleton Induction Module and 3D Skeleton Induction
Module, that are responsible for inducing 2D and 3D pose information into the
RGB representations. These modules operate by performing pose-aware auxiliary
tasks, a design choice that allows $\pi$-ViT to discard the modules during
inference. Notably, $\pi$-ViT achieves the state-of-the-art performance on
three prominent ADL datasets, encompassing both real-world and large-scale
RGB-D datasets, without requiring poses or additional computational overhead at
inference.
Related papers
- Just Dance with $π$! A Poly-modal Inductor for Weakly-supervised Video Anomaly Detection [12.492419773705898]
"PI-VAD" is a novel approach that augments RGB representations by five additional modalities.<n>PI-VAD achieves state-of-the-art accuracy on three prominent VAD scenarios.
arXiv Detail & Related papers (2025-05-19T13:51:57Z) - PanoSLAM: Panoptic 3D Scene Reconstruction via Gaussian SLAM [105.01907579424362]
PanoSLAM is the first SLAM system to integrate geometric reconstruction, 3D semantic segmentation, and 3D instance segmentation within a unified framework.
For the first time, it achieves panoptic 3D reconstruction of open-world environments directly from the RGB-D video.
arXiv Detail & Related papers (2024-12-31T08:58:10Z) - Towards Human-Level 3D Relative Pose Estimation: Generalizable, Training-Free, with Single Reference [62.99706119370521]
Humans can easily deduce the relative pose of an unseen object, without label/training, given only a single query-reference image pair.
We propose a novel 3D generalizable relative pose estimation method by elaborating (i) with a 2.5D shape from an RGB-D reference, (ii) with an off-the-shelf differentiable, and (iii) with semantic cues from a pretrained model like DINOv2.
arXiv Detail & Related papers (2024-06-26T16:01:10Z) - ViDSOD-100: A New Dataset and a Baseline Model for RGB-D Video Salient Object Detection [51.16181295385818]
We first collect an annotated RGB-D video SODOD (DSOD-100) dataset, which contains 100 videos within a total of 9,362 frames.
All the frames in each video are manually annotated to a high-quality saliency annotation.
We propose a new baseline model, named attentive triple-fusion network (ATF-Net) for RGB-D salient object detection.
arXiv Detail & Related papers (2024-06-18T12:09:43Z) - EvPlug: Learn a Plug-and-Play Module for Event and Image Fusion [55.367269556557645]
EvPlug learns a plug-and-play event and image fusion module from the supervision of the existing RGB-based model.
We demonstrate the superiority of EvPlug in several vision tasks such as object detection, semantic segmentation, and 3D hand pose estimation.
arXiv Detail & Related papers (2023-12-28T10:05:13Z) - Salient Object Detection in RGB-D Videos [11.805682025734551]
This paper makes two primary contributions: the dataset and the model.
We construct the RDVS dataset, a new RGB-D VSOD dataset with realistic depth.
We introduce DCTNet+, a three-stream network tailored for RGB-D VSOD.
arXiv Detail & Related papers (2023-10-24T03:18:07Z) - DFormer: Rethinking RGBD Representation Learning for Semantic
Segmentation [76.81628995237058]
DFormer is a novel framework to learn transferable representations for RGB-D segmentation tasks.
It pretrains the backbone using image-depth pairs from ImageNet-1K.
DFormer achieves new state-of-the-art performance on two popular RGB-D tasks.
arXiv Detail & Related papers (2023-09-18T11:09:11Z) - A Strong Transfer Baseline for RGB-D Fusion in Vision Transformers [0.0]
We propose a recipe for transferring pretrained ViTs in RGB-D domains for single-view 3D object recognition.
We show that our adapted ViTs score up to 95.1% top-1 accuracy in Washington, achieving new state-of-the-art results in this benchmark.
arXiv Detail & Related papers (2022-10-03T12:08:09Z) - Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient
Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network.
We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs.
Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z) - MFEViT: A Robust Lightweight Transformer-based Network for Multimodal
2D+3D Facial Expression Recognition [1.7448845398590227]
Vision transformer (ViT) has been widely applied in many areas due to its self-attention mechanism.
We propose a robust lightweight pure transformer-based network for multimodal 2D+3D FER, namely MFEViT.
Our MFEViT outperforms state-of-the-art approaches with an accuracy of 90.83% on BU-3DFE and 90.28% on Bosphorus.
arXiv Detail & Related papers (2021-09-20T17:19:39Z) - RGB2Hands: Real-Time Tracking of 3D Hand Interactions from Monocular RGB
Video [76.86512780916827]
We present the first real-time method for motion capture of skeletal pose and 3D surface geometry of hands from a single RGB camera.
In order to address the inherent depth ambiguities in RGB data, we propose a novel multi-task CNN.
We experimentally verify the individual components of our RGB two-hand tracking and 3D reconstruction pipeline.
arXiv Detail & Related papers (2021-06-22T12:53:56Z) - VPN++: Rethinking Video-Pose embeddings for understanding Activities of
Daily Living [8.765045867163648]
We propose an extension of a pose driven attention mechanism: Video-Pose Network (VPN)
We show that VPN++ is not only effective but also provides a high speed up and high resilience to noisy Poses.
arXiv Detail & Related papers (2021-05-17T20:19:47Z) - Infrared and 3D skeleton feature fusion for RGB-D action recognition [0.30458514384586394]
We propose a modular network combining skeleton and infrared data.
A 2D convolutional network (CNN) is used as a pose module to extract features from skeleton data.
A 3D CNN is used as an infrared module to extract visual cues from videos.
arXiv Detail & Related papers (2020-02-28T17:42:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.