Related papers: Spectral Compression Transformer with Line Pose Graph for Monocular 3D Human Pose Estimation

Spectral Compression Transformer with Line Pose Graph for Monocular 3D Human Pose Estimation

URL: http://arxiv.org/abs/2505.21309v1
Date: Tue, 27 May 2025 15:08:03 GMT
Title: Spectral Compression Transformer with Line Pose Graph for Monocular 3D Human Pose Estimation
Authors: Zenghao Zheng, Lianping Yang, Hegui Zhu, Mingrui Ye,
Abstract summary: We introduce the Spectral Compression Transformer (SCT) to reduce sequence length and accelerate computation.<n>The LPG generates skeletal position information that complements the input 2D joint positions.<n>Our model achieves state-of-the-art performance with improved computational efficiency.
Score: 1.8999296421549172
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer-based 3D human pose estimation methods suffer from high computational costs due to the quadratic complexity of self-attention with respect to sequence length. Additionally, pose sequences often contain significant redundancy between frames. However, recent methods typically fail to improve model capacity while effectively eliminating sequence redundancy. In this work, we introduce the Spectral Compression Transformer (SCT) to reduce sequence length and accelerate computation. The SCT encoder treats hidden features between blocks as Temporal Feature Signals (TFS) and applies the Discrete Cosine Transform, a Fourier transform-based technique, to determine the spectral components to be retained. By filtering out certain high-frequency noise components, SCT compresses the sequence length and reduces redundancy. To further enrich the input sequence with prior structural information, we propose the Line Pose Graph (LPG) based on line graph theory. The LPG generates skeletal position information that complements the input 2D joint positions, thereby improving the model's performance. Finally, we design a dual-stream network architecture to effectively model spatial joint relationships and the compressed motion trajectory within the pose sequence. Extensive experiments on two benchmark datasets (i.e., Human3.6M and MPI-INF-3DHP) demonstrate that our model achieves state-of-the-art performance with improved computational efficiency. For example, on the Human3.6M dataset, our method achieves an MPJPE of 37.7mm while maintaining a low computational cost. Furthermore, we perform ablation studies on each module to assess its effectiveness. The code and models will be released.

Related papers

Multi-Modal Graph Convolutional Network with Sinusoidal Encoding for Robust Human Action Segmentation [10.122882293302787]
temporal segmentation of human actions is critical for intelligent robots in collaborative settings.<n>We propose a Multi-Modal Graph Convolutional Network (MMGCN) that integrates low-frame-rate (e.g., 1 fps) visual data with high-frame-rate (e.g., 30 fps) motion data.<n>Our approach outperforms state-of-the-art methods, especially in action segmentation accuracy.
arXiv Detail & Related papers (2025-07-01T13:55:57Z)
RobustSplat: Decoupling Densification and Dynamics for Transient-Free 3DGS [79.15416002879239]
3D Gaussian Splatting has gained significant attention for its real-time, photo-realistic rendering in novel-view synthesis and 3D modeling.<n>Existing methods struggle with accurately modeling scenes affected by transient objects, leading to artifacts in the rendered images.<n>We propose RobustSplat, a robust solution based on two critical designs.
arXiv Detail & Related papers (2025-06-03T11:13:48Z)
3DGS Compression with Sparsity-guided Hierarchical Transform Coding [19.575833741231953]
Sparsity-guided Hierarchical Transform Coding (SHTC) is first end-to-end optimized transform coding framework for 3DGS compression.<n>SHTC jointly optimize the 3DGS, transforms and a lightweight context model.<n>This novel design significantly improves R-D performance with minimal additional parameters and computational overhead.
arXiv Detail & Related papers (2025-05-28T22:17:24Z)
Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation [36.93661496405653]
We take a global approach to exploit Transformer-temporal information with a concise Graph and Skipped Transformer architecture. Specifically, in 3D pose stage, coarse-grained body parts are deployed to construct a fully data-driven adaptive model. Experiments are conducted on Human3.6M, MPI-INF-3DHP and Human-Eva benchmarks.
arXiv Detail & Related papers (2024-07-03T10:42:09Z)
Coarse-Fine Spectral-Aware Deformable Convolution For Hyperspectral Image Reconstruction [15.537910100051866]
We study the inverse problem of Coded Aperture Snapshot Spectral Imaging (CASSI) We propose Coarse-Fine Spectral-Aware Deformable Convolution Network (CFSDCN) Our CFSDCN significantly outperforms previous state-of-the-art (SOTA) methods on both simulated and real HSI datasets.
arXiv Detail & Related papers (2024-06-18T15:15:12Z)
SMPLer: Taming Transformers for Monocular 3D Human Shape and Pose Estimation [74.07836010698801]
We propose an SMPL-based Transformer framework (SMPLer) to address this issue. SMPLer incorporates two key ingredients: a decoupled attention operation and an SMPL-based target representation. Extensive experiments demonstrate the effectiveness of SMPLer against existing 3D human shape and pose estimation methods.
arXiv Detail & Related papers (2024-04-23T17:59:59Z)
S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR) Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection. In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z)
Unfolding Framework with Prior of Convolution-Transformer Mixture and Uncertainty Estimation for Video Snapshot Compressive Imaging [7.601695814245209]
We consider the problem of video snapshot compressive imaging (SCI), where sequential high-speed frames are modulated by different masks and captured by a single measurement. By combining optimization algorithms and neural networks, deep unfolding networks (DUNs) score tremendous achievements in solving inverse problems.
arXiv Detail & Related papers (2023-06-20T06:25:48Z)
UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed. The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features. Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z)
MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video [75.23812405203778]
Recent solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames globally to learn-temporal correlation. We propose Mix Mix, which has temporal transformer block to separately model the temporal motion of each joint and a transformer block inter-joint spatial correlation. In addition, the network output is extended from the central frame to entire frames of input video, improving the coherence between the input and output benchmarks.
arXiv Detail & Related papers (2022-03-02T04:20:59Z)
PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result. Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z)
Lifting Transformer for 3D Human Pose Estimation in Video [27.005291611674377]
We propose a novel Transformer-based architecture, called Lifting Transformer, for 3D human pose estimation. A vanilla Transformer encoder (VTE) is adopted to model long-range dependencies of 2D pose sequences. A modified VTE is termed as strided Transformer encoder (STE) and it is built upon the outputs of VTE.
arXiv Detail & Related papers (2021-03-26T07:35:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.