Related papers: Seeing the Pose in the Pixels: Learning Pose-Aware Representations in Vision Transformers

Seeing the Pose in the Pixels: Learning Pose-Aware Representations in Vision Transformers

URL: http://arxiv.org/abs/2306.09331v1
Date: Thu, 15 Jun 2023 17:58:39 GMT
Title: Seeing the Pose in the Pixels: Learning Pose-Aware Representations in Vision Transformers
Authors: Dominick Reilly and Aman Chadha and Srijan Das
Abstract summary: We introduce two strategies for learning pose-aware representations in Vision Transformer (ViT) The first method, called Pose-aware Attention Block (PAAB), is a plug-and-play ViT block that performs localized attention on pose regions within videos. The second method, dubbed Pose-Aware Auxiliary Task (PAAT), presents an auxiliary pose prediction task optimized jointly with the primary ViT task.
Score: 1.8047694351309207
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human perception of surroundings is often guided by the various poses present within the environment. Many computer vision tasks, such as human action recognition and robot imitation learning, rely on pose-based entities like human skeletons or robotic arms. However, conventional Vision Transformer (ViT) models uniformly process all patches, neglecting valuable pose priors in input videos. We argue that incorporating poses into RGB data is advantageous for learning fine-grained and viewpoint-agnostic representations. Consequently, we introduce two strategies for learning pose-aware representations in ViTs. The first method, called Pose-aware Attention Block (PAAB), is a plug-and-play ViT block that performs localized attention on pose regions within videos. The second method, dubbed Pose-Aware Auxiliary Task (PAAT), presents an auxiliary pose prediction task optimized jointly with the primary ViT task. Although their functionalities differ, both methods succeed in learning pose-aware representations, enhancing performance in multiple diverse downstream tasks. Our experiments, conducted across seven datasets, reveal the efficacy of both pose-aware methods on three video analysis tasks, with PAAT holding a slight edge over PAAB. Both PAAT and PAAB surpass their respective backbone Transformers by up to 9.8% in real-world action recognition and 21.8% in multi-view robotic video alignment. Code is available at https://github.com/dominickrei/PoseAwareVT.

Related papers

PVChat: Personalized Video Chat with One-Shot Learning [15.328085576102106]
PVChat is a one-shot learning framework that enables subject-aware question answering from a single video for each subject. Our approach optimize a Mixture-of-Heads (MoH) enhanced ViLLM on a synthetically augmented video-QA dataset. We evaluate PVChat on diverse datasets covering medical scenarios, TV series, anime, and real-world footage.
arXiv Detail & Related papers (2025-03-21T11:50:06Z)
UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing [79.68232381605661]
We present UniPose, a framework to comprehend, generate, and edit human poses across various modalities. Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary. Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities.
arXiv Detail & Related papers (2024-11-25T08:06:30Z)
DVANet: Disentangling View and Action Features for Multi-View Action Recognition [56.283944756315066]
We present a novel approach to multi-view action recognition where we guide learned action representations to be separated from view-relevant information in a video. Our model and method of training significantly outperforms all other uni-modal models on four multi-view action recognition datasets.
arXiv Detail & Related papers (2023-12-10T01:19:48Z)
ViLP: Knowledge Exploration using Vision, Language, and Pose Embeddings for Video Action Recognition [4.36572039512405]
We present the first pose augmented Vision-language model (VLM) for Video Action Recognition. Notably, our scheme achieves an accuracy of 92.81% and 73.02% on two popular human video action recognition benchmark datasets.
arXiv Detail & Related papers (2023-08-07T20:50:54Z)
RUST: Latent Neural Scene Representations from Unposed Imagery [21.433079925439234]
Inferring structure of 3D scenes from 2D observations is a fundamental challenge in computer vision. Recent popularized approaches based on neural scene representations have achieved tremendous impact. RUST (Really Unposed Scene representation Transformer) is a pose-free approach to novel view trained on RGB images alone.
arXiv Detail & Related papers (2022-11-25T18:59:10Z)
Learning Object Manipulation Skills from Video via Approximate Differentiable Physics [27.923004421974156]
We teach robots to perform simple object manipulation tasks by watching a single video demonstration. A differentiable scene ensures perceptual fidelity between the 3D scene and the 2D video. We evaluate our approach on a 3D reconstruction task that consists of 54 video demonstrations.
arXiv Detail & Related papers (2022-08-03T10:21:47Z)
VirtualPose: Learning Generalizable 3D Human Pose Models from Virtual Data [69.64723752430244]
We introduce VirtualPose, a two-stage learning framework to exploit the hidden "free lunch" specific to this task. The first stage transforms images to abstract geometry representations (AGR), and then the second maps them to 3D poses. It addresses the generalization issue from two aspects: (1) the first stage can be trained on diverse 2D datasets to reduce the risk of over-fitting to limited appearance; (2) the second stage can be trained on diverse AGR synthesized from a large number of virtual cameras and poses.
arXiv Detail & Related papers (2022-07-20T14:47:28Z)
Patch-level Representation Learning for Self-supervised Vision Transformers [68.8862419248863]
Vision Transformers (ViTs) have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks. Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations. We demonstrate that SelfPatch can significantly improve the performance of existing SSL methods for various visual tasks.
arXiv Detail & Related papers (2022-06-16T08:01:19Z)
Unsupervised 3D Human Pose Representation with Viewpoint and Pose Disentanglement [63.853412753242615]
Learning a good 3D human pose representation is important for human pose related tasks. We propose a novel Siamese denoising autoencoder to learn a 3D pose representation. Our approach achieves state-of-the-art performance on two inherently different tasks.
arXiv Detail & Related papers (2020-07-14T14:25:22Z)
IntegralAction: Pose-driven Feature Integration for Robust Human Action Recognition in Videos [94.06960017351574]
We learn pose-driven feature integration that dynamically combines appearance and pose streams by observing pose features on the fly. We show that the proposed IntegralAction achieves highly robust performance across in-context and out-of-context action video datasets.
arXiv Detail & Related papers (2020-07-13T11:24:48Z)
Active Perception and Representation for Robotic Manipulation [0.8315801422499861]
We present a framework that leverages the benefits of active perception to accomplish manipulation tasks. Our agent uses viewpoint changes to localize objects, to learn state representations in a self-supervised manner, and to perform goal-directed actions. Compared to vanilla deep Q-learning algorithms, our model is at least four times more sample-efficient.
arXiv Detail & Related papers (2020-03-15T01:43:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.