STAF: 3D Human Mesh Recovery from Video with Spatio-Temporal Alignment
Fusion
- URL: http://arxiv.org/abs/2401.01730v1
- Date: Wed, 3 Jan 2024 13:07:14 GMT
- Title: STAF: 3D Human Mesh Recovery from Video with Spatio-Temporal Alignment
Fusion
- Authors: Wei Yao, Hongwen Zhang, Yunlian Sun, and Jinhui Tang
- Abstract summary: Existing models usually ignore spatial and temporal information, which might lead to mesh and image misalignment and temporal discontinuity.
As a video-based model, it leverages coherence clues from human motion by an attention-based Temporal Coherence Fusion Module.
In addition, we propose an Average Pooling Module (APM) to allow the model to focus on the entire input sequence rather than just the target frame.
- Score: 35.42718669331158
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recovery of 3D human mesh from monocular images has significantly been
developed in recent years. However, existing models usually ignore spatial and
temporal information, which might lead to mesh and image misalignment and
temporal discontinuity. For this reason, we propose a novel Spatio-Temporal
Alignment Fusion (STAF) model. As a video-based model, it leverages coherence
clues from human motion by an attention-based Temporal Coherence Fusion Module
(TCFM). As for spatial mesh-alignment evidence, we extract fine-grained local
information through predicted mesh projection on the feature maps. Based on the
spatial features, we further introduce a multi-stage adjacent Spatial Alignment
Fusion Module (SAFM) to enhance the feature representation of the target frame.
In addition to the above, we propose an Average Pooling Module (APM) to allow
the model to focus on the entire input sequence rather than just the target
frame. This method can remarkably improve the smoothness of recovery results
from video. Extensive experiments on 3DPW, MPII3D, and H36M demonstrate the
superiority of STAF. We achieve a state-of-the-art trade-off between precision
and smoothness. Our code and more video results are on the project page
https://yw0208.github.io/staf/
Related papers
- Enhancing 3D Human Pose Estimation Amidst Severe Occlusion with Dual Transformer Fusion [13.938406073551844]
This paper introduces a Dual Transformer Fusion (DTF) algorithm, a novel approach to obtain a holistic 3D pose estimation.
To enable precise 3D Human Pose Estimation, our approach leverages the innovative DTF architecture, which first generates a pair of intermediate views.
Our approach outperforms existing state-of-the-art methods on both datasets, yielding substantial improvements.
arXiv Detail & Related papers (2024-10-06T18:15:27Z) - SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - LEF: Late-to-Early Temporal Fusion for LiDAR 3D Object Detection [40.267769862404684]
We propose a late-to-early recurrent feature fusion scheme for 3D object detection using temporal LiDAR point clouds.
Our main motivation is fusing object-aware latent embeddings into the early stages of a 3D object detector.
arXiv Detail & Related papers (2023-09-28T21:58:25Z) - DSFNet: Dual Space Fusion Network for Occlusion-Robust 3D Dense Face
Alignment [34.223372986832544]
The state-of-the-art 3DMM-based method, directly regresses the model's coefficients.
We propose a fusion network that combines the advantages of both the image and model space predictions.
arXiv Detail & Related papers (2023-05-19T08:43:37Z) - Stand-Alone Inter-Frame Attention in Video Models [164.06137994796487]
We present a new recipe of inter-frame attention block, namely Stand-alone Inter-temporal Attention (SIFA)
SIFA remoulds the deformable design via re-scaling the offset predictions by the difference between two frames.
We further plug SIFA block into ConvNets and Vision Transformer, respectively, to devise SIFA-Net and SIFA-Transformer.
arXiv Detail & Related papers (2022-06-14T15:51:28Z) - Learning Dynamic View Synthesis With Few RGBD Cameras [60.36357774688289]
We propose to utilize RGBD cameras to synthesize free-viewpoint videos of dynamic indoor scenes.
We generate point clouds from RGBD frames and then render them into free-viewpoint videos via a neural feature.
We introduce a simple Regional Depth-Inpainting module that adaptively inpaints missing depth values to render complete novel views.
arXiv Detail & Related papers (2022-04-22T03:17:35Z) - Capturing Humans in Motion: Temporal-Attentive 3D Human Pose and Shape
Estimation from Monocular Video [24.217269857183233]
We propose a motion pose and shape network (MPS-Net) to capture humans in motion to estimate 3D human pose and shape from a video.
Specifically, we first propose a motion continuity attention (MoCA) module that leverages visual cues observed from human motion to adaptively recalibrate the range that needs attention in the sequence.
By coupling the MoCA and HAFI modules, the proposed MPS-Net excels in estimating 3D human pose and shape in the video.
arXiv Detail & Related papers (2022-03-16T11:00:24Z) - DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection [83.18142309597984]
Lidars and cameras are critical sensors that provide complementary information for 3D detection in autonomous driving.
We develop a family of generic multi-modal 3D detection models named DeepFusion, which is more accurate than previous methods.
arXiv Detail & Related papers (2022-03-15T18:46:06Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - Spatio-Temporal Self-Attention Network for Video Saliency Prediction [13.873682190242365]
3D convolutional neural networks have achieved promising results for video tasks in computer vision.
We propose a novel Spatio-Temporal Self-Temporal Self-Attention 3 Network (STSANet) for video saliency prediction.
arXiv Detail & Related papers (2021-08-24T12:52:47Z) - Self-Attentive 3D Human Pose and Shape Estimation from Videos [82.63503361008607]
We present a video-based learning algorithm for 3D human pose and shape estimation.
We exploit temporal information in videos and propose a self-attention module.
We evaluate our method on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets.
arXiv Detail & Related papers (2021-03-26T00:02:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.