MotionAGFormer: Enhancing 3D Human Pose Estimation with a
Transformer-GCNFormer Network
- URL: http://arxiv.org/abs/2310.16288v1
- Date: Wed, 25 Oct 2023 01:46:35 GMT
- Title: MotionAGFormer: Enhancing 3D Human Pose Estimation with a
Transformer-GCNFormer Network
- Authors: Soroush Mehraban, Vida Adeli, Babak Taati
- Abstract summary: We present a novel Attention-GCNFormer block that divides the number of channels by using two parallel transformer and GCNFormer streams.
Our proposed GCNFormer module exploits the local relationship between adjacent joints, outputting a new representation that is complementary to the transformer output.
We evaluate our model on two popular benchmark datasets: Human3.6M and MPI-INF-3DHP.
- Score: 2.7268855969580166
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent transformer-based approaches have demonstrated excellent performance
in 3D human pose estimation. However, they have a holistic view and by encoding
global relationships between all the joints, they do not capture the local
dependencies precisely. In this paper, we present a novel Attention-GCNFormer
(AGFormer) block that divides the number of channels by using two parallel
transformer and GCNFormer streams. Our proposed GCNFormer module exploits the
local relationship between adjacent joints, outputting a new representation
that is complementary to the transformer output. By fusing these two
representation in an adaptive way, AGFormer exhibits the ability to better
learn the underlying 3D structure. By stacking multiple AGFormer blocks, we
propose MotionAGFormer in four different variants, which can be chosen based on
the speed-accuracy trade-off. We evaluate our model on two popular benchmark
datasets: Human3.6M and MPI-INF-3DHP. MotionAGFormer-B achieves
state-of-the-art results, with P1 errors of 38.4mm and 16.2mm, respectively.
Remarkably, it uses a quarter of the parameters and is three times more
computationally efficient than the previous leading model on Human3.6M dataset.
Code and models are available at https://github.com/TaatiTeam/MotionAGFormer.
Related papers
- S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR)
Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection.
In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z) - Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation [73.31524865643709]
We present a plug-and-play pruning-and-recovering framework, called Hourglass Tokenizer (HoT), for efficient transformer-based 3D pose estimation from videos.
Our HoDT begins with pruning pose tokens of redundant frames and ends with recovering full-length tokens, resulting in a few pose tokens in the intermediate transformer blocks.
Our method can achieve both high efficiency and estimation accuracy compared to the original VPT models.
arXiv Detail & Related papers (2023-11-20T18:59:51Z) - MatFormer: Nested Transformer for Elastic Inference [94.1789252941718]
MatFormer is a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints.
We show that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B.
We also observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval.
arXiv Detail & Related papers (2023-10-11T17:57:14Z) - AMPose: Alternately Mixed Global-Local Attention Model for 3D Human Pose
Estimation [2.9823712604345993]
We propose a novel method to combine the global and physically connected relations among joints towards 3D human pose estimation.
In the AMPose, the Transformer encoder is applied to connect each joint with all the other joints, while GCNs are applied to capture information on physically connected relations.
Our model also shows better generalization ability by testing on the MPI-INF-3DHP dataset.
arXiv Detail & Related papers (2022-10-09T10:10:13Z) - K-Order Graph-oriented Transformer with GraAttention for 3D Pose and
Shape Estimation [20.711789781518753]
We propose a novel attention-based 2D-to-3D pose estimation network for graph-structured data, named KOG-Transformer.
We also propose a 3D pose-to-shape estimation network for hand data, named GASE-Net.
arXiv Detail & Related papers (2022-08-24T06:54:03Z) - Jointformer: Single-Frame Lifting Transformer with Error Prediction and
Refinement for 3D Human Pose Estimation [11.592567773739407]
3D human pose estimation technologies have the potential to greatly increase the availability of human movement data.
The best-performing models for single-image 2D-3D lifting use graph convolutional networks (GCNs) that typically require some manual input to define the relationships between different body joints.
We propose a novel transformer-based approach that uses the more generalised self-attention mechanism to learn these relationships.
arXiv Detail & Related papers (2022-08-07T12:07:19Z) - CrossFormer: Cross Spatio-Temporal Transformer for 3D Human Pose
Estimation [24.08170512746056]
3D human pose estimation can be handled by encoding the geometric dependencies between the body parts and enforcing the kinematic constraints.
Recent Transformer has been adopted to encode the long-range dependencies between the joints in the spatial and temporal domains.
We propose a novel pose estimation Transformer featuring rich representations of body joints critical for capturing subtle changes across frames.
arXiv Detail & Related papers (2022-03-24T23:40:11Z) - MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose
Estimation in Video [75.23812405203778]
Recent solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames globally to learn-temporal correlation.
We propose Mix Mix, which has temporal transformer block to separately model the temporal motion of each joint and a transformer block inter-joint spatial correlation.
In addition, the network output is extended from the central frame to entire frames of input video, improving the coherence between the input and output benchmarks.
arXiv Detail & Related papers (2022-03-02T04:20:59Z) - Mesh Graphormer [17.75480888764098]
We present a graph-convolution-reinforced transformer, named Mesh Graphormer, for 3D human pose and mesh reconstruction from a single image.
arXiv Detail & Related papers (2021-04-01T06:16:36Z) - Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with
Transformers [115.90778814368703]
Our objective is language-based search of large-scale image and video datasets.
For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales.
An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings.
arXiv Detail & Related papers (2021-03-30T17:57:08Z) - 3D Human Pose Estimation with Spatial and Temporal Transformers [59.433208652418976]
We present PoseFormer, a purely transformer-based approach for 3D human pose estimation in videos.
Inspired by recent developments in vision transformers, we design a spatial-temporal transformer structure.
We quantitatively and qualitatively evaluate our method on two popular and standard benchmark datasets.
arXiv Detail & Related papers (2021-03-18T18:14:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.