Seeing the Pose in the Pixels: Learning Pose-Aware Representations in
Vision Transformers
- URL: http://arxiv.org/abs/2306.09331v1
- Date: Thu, 15 Jun 2023 17:58:39 GMT
- Title: Seeing the Pose in the Pixels: Learning Pose-Aware Representations in
Vision Transformers
- Authors: Dominick Reilly and Aman Chadha and Srijan Das
- Abstract summary: We introduce two strategies for learning pose-aware representations in Vision Transformer (ViT)
The first method, called Pose-aware Attention Block (PAAB), is a plug-and-play ViT block that performs localized attention on pose regions within videos.
The second method, dubbed Pose-Aware Auxiliary Task (PAAT), presents an auxiliary pose prediction task optimized jointly with the primary ViT task.
- Score: 1.8047694351309207
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human perception of surroundings is often guided by the various poses present
within the environment. Many computer vision tasks, such as human action
recognition and robot imitation learning, rely on pose-based entities like
human skeletons or robotic arms. However, conventional Vision Transformer (ViT)
models uniformly process all patches, neglecting valuable pose priors in input
videos. We argue that incorporating poses into RGB data is advantageous for
learning fine-grained and viewpoint-agnostic representations. Consequently, we
introduce two strategies for learning pose-aware representations in ViTs. The
first method, called Pose-aware Attention Block (PAAB), is a plug-and-play ViT
block that performs localized attention on pose regions within videos. The
second method, dubbed Pose-Aware Auxiliary Task (PAAT), presents an auxiliary
pose prediction task optimized jointly with the primary ViT task. Although
their functionalities differ, both methods succeed in learning pose-aware
representations, enhancing performance in multiple diverse downstream tasks.
Our experiments, conducted across seven datasets, reveal the efficacy of both
pose-aware methods on three video analysis tasks, with PAAT holding a slight
edge over PAAB. Both PAAT and PAAB surpass their respective backbone
Transformers by up to 9.8% in real-world action recognition and 21.8% in
multi-view robotic video alignment. Code is available at
https://github.com/dominickrei/PoseAwareVT.
Related papers
- UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing [79.68232381605661]
We present UniPose, a framework to comprehend, generate, and edit human poses across various modalities.
Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary.
Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities.
arXiv Detail & Related papers (2024-11-25T08:06:30Z) - DVANet: Disentangling View and Action Features for Multi-View Action
Recognition [56.283944756315066]
We present a novel approach to multi-view action recognition where we guide learned action representations to be separated from view-relevant information in a video.
Our model and method of training significantly outperforms all other uni-modal models on four multi-view action recognition datasets.
arXiv Detail & Related papers (2023-12-10T01:19:48Z) - ViLP: Knowledge Exploration using Vision, Language, and Pose Embeddings
for Video Action Recognition [4.36572039512405]
We present the first pose augmented Vision-language model (VLM) for Video Action Recognition.
Notably, our scheme achieves an accuracy of 92.81% and 73.02% on two popular human video action recognition benchmark datasets.
arXiv Detail & Related papers (2023-08-07T20:50:54Z) - RUST: Latent Neural Scene Representations from Unposed Imagery [21.433079925439234]
Inferring structure of 3D scenes from 2D observations is a fundamental challenge in computer vision.
Recent popularized approaches based on neural scene representations have achieved tremendous impact.
RUST (Really Unposed Scene representation Transformer) is a pose-free approach to novel view trained on RGB images alone.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - Learning Object Manipulation Skills from Video via Approximate
Differentiable Physics [27.923004421974156]
We teach robots to perform simple object manipulation tasks by watching a single video demonstration.
A differentiable scene ensures perceptual fidelity between the 3D scene and the 2D video.
We evaluate our approach on a 3D reconstruction task that consists of 54 video demonstrations.
arXiv Detail & Related papers (2022-08-03T10:21:47Z) - VirtualPose: Learning Generalizable 3D Human Pose Models from Virtual
Data [69.64723752430244]
We introduce VirtualPose, a two-stage learning framework to exploit the hidden "free lunch" specific to this task.
The first stage transforms images to abstract geometry representations (AGR), and then the second maps them to 3D poses.
It addresses the generalization issue from two aspects: (1) the first stage can be trained on diverse 2D datasets to reduce the risk of over-fitting to limited appearance; (2) the second stage can be trained on diverse AGR synthesized from a large number of virtual cameras and poses.
arXiv Detail & Related papers (2022-07-20T14:47:28Z) - Patch-level Representation Learning for Self-supervised Vision
Transformers [68.8862419248863]
Vision Transformers (ViTs) have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks.
Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations.
We demonstrate that SelfPatch can significantly improve the performance of existing SSL methods for various visual tasks.
arXiv Detail & Related papers (2022-06-16T08:01:19Z) - Unsupervised 3D Human Pose Representation with Viewpoint and Pose
Disentanglement [63.853412753242615]
Learning a good 3D human pose representation is important for human pose related tasks.
We propose a novel Siamese denoising autoencoder to learn a 3D pose representation.
Our approach achieves state-of-the-art performance on two inherently different tasks.
arXiv Detail & Related papers (2020-07-14T14:25:22Z) - IntegralAction: Pose-driven Feature Integration for Robust Human Action
Recognition in Videos [94.06960017351574]
We learn pose-driven feature integration that dynamically combines appearance and pose streams by observing pose features on the fly.
We show that the proposed IntegralAction achieves highly robust performance across in-context and out-of-context action video datasets.
arXiv Detail & Related papers (2020-07-13T11:24:48Z) - Active Perception and Representation for Robotic Manipulation [0.8315801422499861]
We present a framework that leverages the benefits of active perception to accomplish manipulation tasks.
Our agent uses viewpoint changes to localize objects, to learn state representations in a self-supervised manner, and to perform goal-directed actions.
Compared to vanilla deep Q-learning algorithms, our model is at least four times more sample-efficient.
arXiv Detail & Related papers (2020-03-15T01:43:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.