H$_{2}$OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers
- URL: http://arxiv.org/abs/2509.06956v1
- Date: Mon, 08 Sep 2025 17:59:59 GMT
- Title: H$_{2}$OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers
- Authors: Wenhao Li, Mengyuan Liu, Hong Liu, Pichao Wang, Shijian Lu, Nicu Sebe,
- Abstract summary: We present a hierarchical plug-and-play pruning-and-$-recovering framework, called Hierarchical Hourglass Tokenizer (H$_2$OT)<n>Our method is general-purpose: it can be easily incorporated into common VPT models on both seq2seq and seq2frame pipelines.
- Score: 124.11648300910444
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this paper, we present a hierarchical plug-and-play pruning-and-recovering framework, called Hierarchical Hourglass Tokenizer (H$_{2}$OT), for efficient transformer-based 3D human pose estimation from videos. H$_{2}$OT begins with progressively pruning pose tokens of redundant frames and ends with recovering full-length sequences, resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency. It works with two key modules, namely, a Token Pruning Module (TPM) and a Token Recovering Module (TRM). TPM dynamically selects a few representative tokens to eliminate the redundancy of video frames, while TRM restores the detailed spatio-temporal information based on the selected tokens, thereby expanding the network output to the original full-length temporal resolution for fast inference. Our method is general-purpose: it can be easily incorporated into common VPT models on both seq2seq and seq2frame pipelines while effectively accommodating different token pruning and recovery strategies. In addition, our H$_{2}$OT reveals that maintaining the full pose sequence is unnecessary, and a few pose tokens of representative frames can achieve both high efficiency and estimation accuracy. Extensive experiments on multiple benchmark datasets demonstrate both the effectiveness and efficiency of the proposed method. Code and models are available at https://github.com/NationalGAILab/HoT.
Related papers
- Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition [6.168286187549952]
We propose a hybrid end-to-end framework that factorizes learning across three key concepts to reduce inference cost by $330times$ versus prior art.<n> Experiments show that our method results in a lightweight architecture achieving state-of-the-art video recognition performance.
arXiv Detail & Related papers (2025-03-17T21:13:48Z) - Efficient Point Transformer with Dynamic Token Aggregating for LiDAR Point Cloud Processing [19.73918716354272]
LiDAR point cloud processing and analysis have made great progress due to the development of 3D Transformers.<n>Existing 3D Transformer methods usually are computationally expensive and inefficient due to their huge and redundant attention maps.<n>We propose an efficient point TransFormer with Dynamic Token Aggregating (DTA-Former) for point cloud representation and processing.
arXiv Detail & Related papers (2024-05-23T20:50:50Z) - Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation [73.31524865643709]
We present a plug-and-play pruning-and-recovering framework, called Hourglass Tokenizer (HoT), for efficient transformer-based 3D pose estimation from videos.
Our HoDT begins with pruning pose tokens of redundant frames and ends with recovering full-length tokens, resulting in a few pose tokens in the intermediate transformer blocks.
Our method can achieve both high efficiency and estimation accuracy compared to the original VPT models.
arXiv Detail & Related papers (2023-11-20T18:59:51Z) - Fast Multipole Attention: A Scalable Multilevel Attention Mechanism for Text and Images [0.818198392834469]
We introduce Fast Multipole Attention (FMA), a divide-and-conquer mechanism for self-attention inspired by n-body physics.<n>FMA reduces the time and memory complexity of self-attention from $mathcalOleft(n2right)$ to $mathcalO(n log n)$ while preserving full-context interactions.<n>We have developed both 1D and 2D implementations of FMA for language and vision tasks, respectively.
arXiv Detail & Related papers (2023-10-18T13:40:41Z) - Efficient Video Action Detection with Token Dropout and Context
Refinement [67.10895416008911]
We propose an end-to-end framework for efficient video action detection (ViTs)
In a video clip, we maintain tokens from its perspective while preserving tokens relevant to actor motions from other frames.
Second, we refine scene context by leveraging remaining tokens for better recognizing actor identities.
arXiv Detail & Related papers (2023-04-17T17:21:21Z) - Dual-path Adaptation from Image to Video Transformers [62.056751480114784]
We efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters.
We propose a novel DualPath adaptation separated into spatial and temporal adaptation paths, where a lightweight bottleneck adapter is employed in each transformer block.
arXiv Detail & Related papers (2023-03-17T09:37:07Z) - Reversible Vision Transformers [74.3500977090597]
Reversible Vision Transformers are a memory efficient architecture for visual recognition.
We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants.
We find that the additional computational burden of recomputing activations is more than overcome for deeper models.
arXiv Detail & Related papers (2023-02-09T18:59:54Z) - Spatiotemporal Self-attention Modeling with Temporal Patch Shift for
Action Recognition [34.98846882868107]
We propose a Temporal Patch Shift (TPS) method for efficient 3D self-attention modeling in transformers for video-based action recognition.
As a result, we can compute 3D self-attention memory using nearly the same complexity and cost as 2D self-attention.
arXiv Detail & Related papers (2022-07-27T02:47:07Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z) - VideoLightFormer: Lightweight Action Recognition using Transformers [8.871042314510788]
We propose a novel, lightweight action recognition architecture, VideoLightFormer.
In a factorized fashion, we carefully extend the 2D convolutional Temporal Network with transformers.
We evaluate VideoLightFormer in a high-efficiency setting on the temporally-demanding EPIC-KITCHENS-100 and Something-SV-V-Something2 datasets.
arXiv Detail & Related papers (2021-07-01T13:55:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.