Related papers: DiffProxy: Multi-View Human Mesh Recovery via Diffusion-Generated Dense Proxies

DiffProxy: Multi-View Human Mesh Recovery via Diffusion-Generated Dense Proxies

URL: http://arxiv.org/abs/2601.02267v1
Date: Mon, 05 Jan 2026 16:51:45 GMT
Title: DiffProxy: Multi-View Human Mesh Recovery via Diffusion-Generated Dense Proxies
Authors: Renke Wang, Zhenyu Zhang, Ying Tai, Jian Yang,
Abstract summary: Diffproxy is a novel framework that generates multi-view consistent human proxies for mesh recovery.<n>It achieves state-of-the-art performance across five real-world benchmarks.
Score: 34.547846301437474
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human mesh recovery from multi-view images faces a fundamental challenge: real-world datasets contain imperfect ground-truth annotations that bias the models' training, while synthetic data with precise supervision suffers from domain gap. In this paper, we propose DiffProxy, a novel framework that generates multi-view consistent human proxies for mesh recovery. Central to DiffProxy is leveraging the diffusion-based generative priors to bridge the synthetic training and real-world generalization. Its key innovations include: (1) a multi-conditional mechanism for generating multi-view consistent, pixel-aligned human proxies; (2) a hand refinement module that incorporates flexible visual prompts to enhance local details; and (3) an uncertainty-aware test-time scaling method that increases robustness to challenging cases during optimization. These designs ensure that the mesh recovery process effectively benefits from the precise synthetic ground truth and generative advantages of the diffusion-based pipeline. Trained entirely on synthetic data, DiffProxy achieves state-of-the-art performance across five real-world benchmarks, demonstrating strong zero-shot generalization particularly on challenging scenarios with occlusions and partial views. Project page: https://wrk226.github.io/DiffProxy.html

Related papers

Toward Generalizable Deblurring: Leveraging Massive Blur Priors with Linear Attention for Real-World Scenarios [9.82847623835017]
GLOWDeblur is a Generalizable reaL-wOrld lightWeight Deblur model that combines convolution-based pre-reconstruction & domain alignment module with a lightweight diffusion backbone.<n>We propose Blur Pattern Pretraining (BPP), which acquires blur priors from simulation datasets and transfers them through joint fine-tuning on real data.<n>We further introduce Motion and Semantic Guidance (MoSeG) to strengthen blur priors under severe degradation, and integrate it into GLOWDeblur, a Generalizable reaL-wOrld lightWeight Deblur model that combines convolution-based pre-reconstruction &
arXiv Detail & Related papers (2026-01-10T11:01:31Z)
UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass [83.7071371474926]
UniSH is a unified, feed-forward framework for joint metric-scale 3D scene and human reconstruction.<n>Our framework bridges strong, disparate priors from scene reconstruction and HMR.<n>Our model achieves state-of-the-art performance on human-centric scene reconstruction.
arXiv Detail & Related papers (2026-01-03T16:06:27Z)
Patch-Discontinuity Mining for Generalized Deepfake Detection [18.30761992906741]
Deepfake detection methods often rely on handcrafted forensic cues and complex architectures.<n>We propose GenDF, a framework that transfers a powerful vision model to the deepfake detection task with a compact and neat network design.<n>Experiments demonstrate that GenDF achieves state-of-the-art generalization performance in cross-domain and cross-manipulation settings.
arXiv Detail & Related papers (2025-12-26T13:18:14Z)
InvFussion: Bridging Supervised and Zero-shot Diffusion for Inverse Problems [76.39776789410088]
This work introduces a framework that combines the strong performance of supervised approaches and the flexibility of zero-shot methods.<n>A novel architectural design seamlessly integrates the degradation operator directly into the denoiser.<n> Experimental results on the FFHQ and ImageNet datasets demonstrate state-of-the-art posterior-sampling performance.
arXiv Detail & Related papers (2025-04-02T12:40:57Z)
DiffusionFake: Enhancing Generalization in Deepfake Detection via Guided Stable Diffusion [94.46904504076124]
Deepfake technology has made face swapping highly realistic, raising concerns about the malicious use of fabricated facial content. Existing methods often struggle to generalize to unseen domains due to the diverse nature of facial manipulations. We introduce DiffusionFake, a novel framework that reverses the generative process of face forgeries to enhance the generalization of detection models.
arXiv Detail & Related papers (2024-10-06T06:22:43Z)
Face Forgery Detection with Elaborate Backbone [50.914676786151574]
Face Forgery Detection aims to determine whether a digital face is real or fake. Previous FFD models directly employ existing backbones to represent and extract forgery cues. We propose leveraging the ViT network with self-supervised learning on real-face datasets to pre-train a backbone. We then build a competitive backbone fine-tuning framework that strengthens the backbone's ability to extract diverse forgery cues.
arXiv Detail & Related papers (2024-09-25T13:57:16Z)
MFCLIP: Multi-modal Fine-grained CLIP for Generalizable Diffusion Face Forgery Detection [64.29452783056253]
The rapid development of photo-realistic face generation methods has raised significant concerns in society and academia.<n>Although existing approaches mainly capture face forgery patterns using image modality, other modalities like fine-grained noises and texts are not fully explored.<n>We propose a novel multi-modal fine-grained CLIP (MFCLIP) model, which mines comprehensive and fine-grained forgery traces across image-noise modalities.
arXiv Detail & Related papers (2024-09-15T13:08:59Z)
Diffusion Features to Bridge Domain Gap for Semantic Segmentation [2.8616666231199424]
This paper investigates the approach that leverages the sampling and fusion techniques to harness the features of diffusion models efficiently. By leveraging the strength of text-to-image generation capability, we introduce a new training framework designed to implicitly learn posterior knowledge from it.
arXiv Detail & Related papers (2024-06-02T15:33:46Z)
FaceCat: Enhancing Face Recognition Security with a Unified Diffusion Model [30.0523477092216]
Face anti-spoofing (FAS) and adversarial detection (FAD) have been regarded as critical technologies to ensure the safety of face recognition systems. This paper aims to achieve this goal by breaking through two primary obstacles: 1) the suboptimal face feature representation and 2) the scarcity of training data.
arXiv Detail & Related papers (2024-04-14T09:01:26Z)
Single Image Reflection Separation via Component Synergy [14.57590565534889]
The reflection superposition phenomenon is complex and widely distributed in the real world. We propose a more general form of the superposition model by introducing a learnable residue term. In order to fully capitalize on its advantages, we further design the network structure elaborately.
arXiv Detail & Related papers (2023-08-19T14:25:27Z)
Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust. Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model. We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.