PoseCrafter: Extreme Pose Estimation with Hybrid Video Synthesis
- URL: http://arxiv.org/abs/2510.19527v1
- Date: Wed, 22 Oct 2025 12:32:37 GMT
- Title: PoseCrafter: Extreme Pose Estimation with Hybrid Video Synthesis
- Authors: Qing Mao, Tianxin Huang, Yu Zhu, Jinqiu Sun, Yanning Zhang, Gim Hee Lee,
- Abstract summary: Pairwise camera pose estimation from sparsely overlapping image pairs remains a critical and unsolved challenge in 3D vision.<n>Recent approaches attempt to address this by synthesizing intermediate frames using video and selecting key frames via a self-consistency score.<n>We propose Hybrid Video Generation (HVG) to synthesize clearer intermediate frames by coupling a video model with a pose-conditioned novel view model.<n>We also propose a Feature Matching Selector (FMS) based on feature correspondence to select intermediate frames appropriate for pose estimation from the synthesized results.
- Score: 82.87579563469039
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pairwise camera pose estimation from sparsely overlapping image pairs remains a critical and unsolved challenge in 3D vision. Most existing methods struggle with image pairs that have small or no overlap. Recent approaches attempt to address this by synthesizing intermediate frames using video interpolation and selecting key frames via a self-consistency score. However, the generated frames are often blurry due to small overlap inputs, and the selection strategies are slow and not explicitly aligned with pose estimation. To solve these cases, we propose Hybrid Video Generation (HVG) to synthesize clearer intermediate frames by coupling a video interpolation model with a pose-conditioned novel view synthesis model, where we also propose a Feature Matching Selector (FMS) based on feature correspondence to select intermediate frames appropriate for pose estimation from the synthesized results. Extensive experiments on Cambridge Landmarks, ScanNet, DL3DV-10K, and NAVI demonstrate that, compared to existing SOTA methods, PoseCrafter can obviously enhance the pose estimation performances, especially on examples with small or no overlap.
Related papers
- End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer [7.19764062839405]
We present a fully end-to-end framework for multi-person 2D pose estimation in videos.<n>A key challenge is to associate individuals across frames under complex and overlapping temporal trajectories.<n>We introduce a novel Pose-Aware VideoErEr Network (PAVE-Net), which features a spatial encoder to model intra-frame relations and atemporal decoder pose.
arXiv Detail & Related papers (2025-11-17T10:19:35Z) - An End-to-End Framework for Video Multi-Person Pose Estimation [3.090225730976977]
We propose VEPE (Video Endto-End Pose Estimation), a simple and flexible framework for end-to-end pose estimation in video.<n>We show that our approach outperforms two-stage models by 300% and by inference by 300%.
arXiv Detail & Related papers (2025-09-01T03:34:57Z) - A new dataset and comparison for multi-camera frame synthesis [0.0]
We develop a novel multi-camera dataset using a custom-built dense linear camera array.<n>We evaluate classical and deep learning frame interpolators against a view synthesis method for the task of view in-betweening.
arXiv Detail & Related papers (2025-08-12T16:37:30Z) - GaVS: 3D-Grounded Video Stabilization via Temporally-Consistent Local Reconstruction and Rendering [54.489285024494855]
Video stabilization is pivotal for video processing, as it removes unwanted shakiness while preserving the original user motion intent.<n>Existing approaches, depending on the domain they operate, suffer from several issues that degrade the user experience.<n>We introduce textbfGaVS, a novel 3D-grounded approach that reformulates video stabilization as a temporally-consistent local reconstruction and rendering' paradigm.
arXiv Detail & Related papers (2025-06-30T15:24:27Z) - Hybrid bundle-adjusting 3D Gaussians for view consistent rendering with pose optimization [2.8990883469500286]
We introduce a hybrid bundle-adjusting 3D Gaussians model that enables view-consistent rendering with pose optimization.
This model jointly extract image-based and neural 3D representations to simultaneously generate view-consistent images and camera poses within forward-facing scenes.
arXiv Detail & Related papers (2024-10-17T07:13:00Z) - MultiDiff: Consistent Novel View Synthesis from a Single Image [60.04215655745264]
MultiDiff is a novel approach for consistent novel view synthesis of scenes from a single RGB image.
Our results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging, real-world datasets RealEstate10K and ScanNet.
arXiv Detail & Related papers (2024-06-26T17:53:51Z) - Video Interpolation with Diffusion Models [54.06746595879689]
We present VIDIM, a generative model for video, which creates short videos given a start and end frame.
VIDIM uses cascaded diffusion models to first generate the target video at low resolution, and then generate the high-resolution video conditioned on the low-resolution generated video.
arXiv Detail & Related papers (2024-04-01T15:59:32Z) - Enhanced Stable View Synthesis [86.69338893753886]
We introduce an approach to enhance the novel view synthesis from images taken from a freely moving camera.
The introduced approach focuses on outdoor scenes where recovering accurate geometric scaffold and camera pose is challenging.
arXiv Detail & Related papers (2023-03-30T01:53:14Z) - Deep Dual Consecutive Network for Human Pose Estimation [44.41818683253614]
We propose a novel multi-frame human pose estimation framework, leveraging abundant temporal cues between video frames to facilitate keypoint detection.
Our method ranks No.1 in the Multi-frame Person Pose Challenge Challenge on the large-scale benchmark datasets PoseTrack 2017 and PoseTrack 2018.
arXiv Detail & Related papers (2021-03-12T13:11:27Z) - ARVo: Learning All-Range Volumetric Correspondence for Video Deblurring [92.40655035360729]
Video deblurring models exploit consecutive frames to remove blurs from camera shakes and object motions.
We propose a novel implicit method to learn spatial correspondence among blurry frames in the feature space.
Our proposed method is evaluated on the widely-adopted DVD dataset, along with a newly collected High-Frame-Rate (1000 fps) dataset for Video Deblurring.
arXiv Detail & Related papers (2021-03-07T04:33:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.