From Frames to Sequences: Temporally Consistent Human-Centric Dense Prediction
- URL: http://arxiv.org/abs/2602.01661v2
- Date: Tue, 03 Feb 2026 06:44:02 GMT
- Title: From Frames to Sequences: Temporally Consistent Human-Centric Dense Prediction
- Authors: Xingyu Miao, Junting Dong, Qin Zhao, Yuhang Yang, Junhao Chen, Yang Long,
- Abstract summary: We develop a scalable synthetic data pipeline that generates human frames and motion-aligned sequences with pixel-accurate depth, normals, and masks.<n>We train a unified ViT-based dense predictor that injects an explicit geometric human prior via CSE embeddings.<n>Our two-stage training strategy, combining static pretraining with dynamic sequence supervision, enables the model first to acquire robust spatial representations and then refine temporal consistency across motion-aligned sequences.
- Score: 22.291273919939957
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we focus on the challenge of temporally consistent human-centric dense prediction across video sequences. Existing models achieve strong per-frame accuracy but often flicker under motion, occlusion, and lighting changes, and they rarely have paired human video supervision for multiple dense tasks. We address this gap with a scalable synthetic data pipeline that generates photorealistic human frames and motion-aligned sequences with pixel-accurate depth, normals, and masks. Unlike prior static data synthetic pipelines, our pipeline provides both frame-level labels for spatial learning and sequence-level supervision for temporal learning. Building on this, we train a unified ViT-based dense predictor that (i) injects an explicit human geometric prior via CSE embeddings and (ii) improves geometry-feature reliability with a lightweight channel reweighting module after feature fusion. Our two-stage training strategy, combining static pretraining with dynamic sequence supervision, enables the model first to acquire robust spatial representations and then refine temporal consistency across motion-aligned sequences. Extensive experiments show that we achieve state-of-the-art performance on THuman2.1 and Hi4D and generalize effectively to in-the-wild videos.
Related papers
- PoseGaussian: Pose-Driven Novel View Synthesis for Robust 3D Human Reconstruction [13.7448608384464]
PoseGaussian is a pose-guided Gaussian Splatting framework for high-fidelity human novel view synthesis.<n>It is specifically designed to address challenges inherent in dynamic human scenes, such as articulated motion and severe self-occlusion.<n>We validate our approach on ZJU-MoCap, THuman2.0, and in-house datasets, demonstrating state-of-the-art performance in perceptual quality and structural accuracy.
arXiv Detail & Related papers (2026-02-05T01:34:52Z) - SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting [19.12278036176021]
We present SV-GS, a framework that simultaneously estimates a deformation model and the object's motion over time under sparse observations.<n>Our method outperforms existing approaches under sparse observations by up to 34% in PSNR.
arXiv Detail & Related papers (2026-01-01T09:53:03Z) - Multi-Modal Graph Convolutional Network with Sinusoidal Encoding for Robust Human Action Segmentation [10.122882293302787]
temporal segmentation of human actions is critical for intelligent robots in collaborative settings.<n>We propose a Multi-Modal Graph Convolutional Network (MMGCN) that integrates low-frame-rate (e.g., 1 fps) visual data with high-frame-rate (e.g., 30 fps) motion data.<n>Our approach outperforms state-of-the-art methods, especially in action segmentation accuracy.
arXiv Detail & Related papers (2025-07-01T13:55:57Z) - DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation [50.01520547454224]
Current generative models struggle to synthesize 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS)<n>We propose DiST-4D, which disentangles the problem into two diffusion processes: DiST-T, which predicts future metric depth and multi-view RGB sequences directly from past observations, and DiST-S, which enables spatial NVS by training only on existing viewpoints while enforcing cycle consistency.<n>Experiments demonstrate that DiST-4D achieves state-of-the-art performance in both temporal prediction and NVS tasks, while also delivering competitive performance in planning-related evaluations.
arXiv Detail & Related papers (2025-03-19T13:49:48Z) - DATAP-SfM: Dynamic-Aware Tracking Any Point for Robust Structure from Motion in the Wild [85.03973683867797]
This paper proposes a concise, elegant, and robust pipeline to estimate smooth camera trajectories and obtain dense point clouds for casual videos in the wild.
We show that the proposed method achieves state-of-the-art performance in terms of camera pose estimation even in complex dynamic challenge scenes.
arXiv Detail & Related papers (2024-11-20T13:01:16Z) - Dynamic 3D Point Cloud Sequences as 2D Videos [81.46246338686478]
3D point cloud sequences serve as one of the most common and practical representation modalities of real-world environments.
We propose a novel generic representation called textitStructured Point Cloud Videos (SPCVs)
SPCVs re-organizes a point cloud sequence as a 2D video with spatial smoothness and temporal consistency, where the pixel values correspond to the 3D coordinates of points.
arXiv Detail & Related papers (2024-03-02T08:18:57Z) - S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR)
Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection.
In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z) - Skeleton2vec: A Self-supervised Learning Framework with Contextualized
Target Representations for Skeleton Sequence [56.092059713922744]
We show that using high-level contextualized features as prediction targets can achieve superior performance.
Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework.
Our proposed Skeleton2vec outperforms previous methods and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-01-01T12:08:35Z) - Multiscale Residual Learning of Graph Convolutional Sequence Chunks for
Human Motion Prediction [23.212848643552395]
A new method is proposed for human motion prediction by learning temporal and spatial dependencies.
Our proposed method is able to effectively model the sequence information for motion prediction and outperform other techniques to set a new state-of-the-art.
arXiv Detail & Related papers (2023-08-31T15:23:33Z) - STDepthFormer: Predicting Spatio-temporal Depth from Video with a
Self-supervised Transformer Model [0.0]
Self-supervised model simultaneously predicts a sequence of future frames from video-input with a spatial-temporal attention network is proposed.
The proposed model leverages prior scene knowledge such as object shape and texture similar to single-image depth inference methods.
It is implicitly capable of forecasting the motion of objects in the scene, rather than requiring complex models involving multi-object detection, segmentation and tracking.
arXiv Detail & Related papers (2023-03-02T12:22:51Z) - Modelling Latent Dynamics of StyleGAN using Neural ODEs [52.03496093312985]
We learn the trajectory of independently inverted latent codes from GANs.
The learned continuous trajectory allows us to perform infinite frame and consistent video manipulation.
Our method achieves state-of-the-art performance but with much less computation.
arXiv Detail & Related papers (2022-08-23T21:20:38Z) - SeCo: Exploring Sequence Supervision for Unsupervised Representation
Learning [114.58986229852489]
In this paper, we explore the basic and generic supervision in the sequence from spatial, sequential and temporal perspectives.
We derive a particular form named Contrastive Learning (SeCo)
SeCo shows superior results under the linear protocol on action recognition, untrimmed activity recognition and object tracking.
arXiv Detail & Related papers (2020-08-03T15:51:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.