Related papers: Efficient Encoder-Free Pose Conditioning and Pose Control for Virtual Try-On

Efficient Encoder-Free Pose Conditioning and Pose Control for Virtual Try-On

URL: http://arxiv.org/abs/2509.20343v1
Date: Wed, 24 Sep 2025 17:35:23 GMT
Title: Efficient Encoder-Free Pose Conditioning and Pose Control for Virtual Try-On
Authors: Qi Li, Shuwen Qiu, Julien Han, Xingzi Xu, Mehmet Saygin Seyfioglu, Kee Kiat Koo, Karim Bouyarmane,
Abstract summary: We build upon a baseline VTON model by overlaying the reference image condition without external encoder, control network, or complex attention layers.<n>We investigate methods to incorporate pose control into this pure concatenation paradigm by spatially concatenating pose data.<n>Experiments reveal that pose stitching with pose maps yields the best results, enhancing both pose preservation and output realism.
Score: 11.550777201655393
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As online shopping continues to grow, the demand for Virtual Try-On (VTON) technology has surged, allowing customers to visualize products on themselves by overlaying product images onto their own photos. An essential yet challenging condition for effective VTON is pose control, which ensures accurate alignment of products with the user's body while supporting diverse orientations for a more immersive experience. However, incorporating pose conditions into VTON models presents several challenges, including selecting the optimal pose representation, integrating poses without additional parameters, and balancing pose preservation with flexible pose control. In this work, we build upon a baseline VTON model that concatenates the reference image condition without external encoder, control network, or complex attention layers. We investigate methods to incorporate pose control into this pure concatenation paradigm by spatially concatenating pose data, comparing performance using pose maps and skeletons, without adding any additional parameters or module to the baseline model. Our experiments reveal that pose stitching with pose maps yields the best results, enhancing both pose preservation and output realism. Additionally, we introduce a mixed-mask training strategy using fine-grained and bounding box masks, allowing the model to support flexible product integration across varied poses and conditions.

Related papers

SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation [50.792027578906804]
We introduce SteadyDancer, an Image-to-Video (R2V) paradigm-based framework that achieves harmonized and coherent animation.<n> Experiments demonstrate that SteadyDancer achieves state-of-the-art performance in both appearance fidelity and motion control.
arXiv Detail & Related papers (2025-11-24T17:15:55Z)
AvatarVTON: 4D Virtual Try-On for Animatable Avatars [67.13031660684457]
AvatarVTON generates realistic try-on results from a single in-shop garment image.<n>It supports dynamic garment interactions under single-view supervision.<n>It is well-suited for AR/VR, gaming, and digital-human applications.
arXiv Detail & Related papers (2025-10-06T14:06:34Z)
PoseDiff: A Unified Diffusion Model Bridging Robot Pose Estimation and Video-to-Action Control [67.17998939712326]
We present PoseDiff, a conditional diffusion model that unifies robot state estimation and control within a single framework.<n>At its core, PoseDiff maps raw visual observations into structured robot states-such as 3D keypoints or joint angles-from a single RGB image.<n>Building upon this foundation, PoseDiff extends naturally to video-to-action inverse dynamics.
arXiv Detail & Related papers (2025-09-29T10:55:48Z)
OmniVTON: Training-Free Universal Virtual Try-On [53.31945401098557]
Image-based Virtual Try-On (VTON) techniques rely on either supervised in-shop approaches, or unsupervised in-the-wild methods, which improve adaptability but remain constrained by data biases and limited universality.<n>We propose OmniVTON, the first training-free universal VTON framework that decouples garment and pose conditioning to achieve both texture fidelity and pose consistency across diverse settings.
arXiv Detail & Related papers (2025-07-20T16:37:53Z)
PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth [9.737257599532956]
We introduce PosePilot, a lightweight yet powerful framework that significantly enhances camera pose controllability in generative world models.<n>Specifically, we incorporate self-supervised depth and pose readouts, allowing the model to infer depth and relative camera motion directly from video sequences.<n>Experiments on autonomous driving and general-domain video datasets demonstrate that PosePilot significantly enhances structural understanding and motion reasoning.
arXiv Detail & Related papers (2025-05-03T07:51:46Z)
DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation [63.781450025764904]
We propose DynamiCtrl, a novel framework for human animation in video DiT architecture.<n>We use a shared VAE encoder for human images and driving poses, unifying them into a common latent space.<n>We also introduce the "Joint-text" paradigm, which preserves the role of text embeddings to provide global semantic context.
arXiv Detail & Related papers (2025-03-27T08:07:45Z)
ITVTON: Virtual Try-On Diffusion Transformer Based on Integrated Image and Text [11.85544970521423]
We introduce ITVTON, which utilizes the Diffusion Transformer (DiT) as a generator to enhance image quality.<n>ITVTON improves garment-person interaction by stitching garment and person images along the spatial channel.<n>We constrain training to attention parameters within a single Diffusion Transformer (Single-DiT) block.
arXiv Detail & Related papers (2025-01-28T07:24:15Z)
ODPG: Outfitting Diffusion with Pose Guided Condition [2.5602836891933074]
VTON technology allows users to visualize how clothes would look on them without physically trying them on.<n>Traditional VTON methods, often using Geneversarative Adrial Networks (GANs) and Diffusion models, face challenges in achieving high realism and handling dynamic poses.<n>This paper introduces Outfitting Diffusion with Pose Guided Condition (ODPG), a novel approach that leverages a latent diffusion model with multiple conditioning inputs during the denoising process.
arXiv Detail & Related papers (2025-01-12T10:30:27Z)
IMAGDressing-v1: Customizable Virtual Dressing [58.44155202253754]
IMAGDressing-v1 is a virtual dressing task that generates freely editable human images with fixed garments and optional conditions. IMAGDressing-v1 incorporates a garment UNet that captures semantic features from CLIP and texture features from VAE. We present a hybrid attention module, including a frozen self-attention and a trainable cross-attention, to integrate garment features from the garment UNet into a frozen denoising UNet.
arXiv Detail & Related papers (2024-07-17T16:26:30Z)
Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation [32.190055780969466]
Stable-Pose is a novel adapter model that introduces a coarse-to-fine attention masking strategy into a vision Transformer. We leverage the query-key self-attention mechanism of ViTs to explore the interconnections among different anatomical parts in human pose skeletons. Stable-Pose achieved an AP score of 57.1 in the LAION-Human dataset, marking around 13% improvement over the established technique ControlNet.
arXiv Detail & Related papers (2024-06-04T16:54:28Z)
AnyFit: Controllable Virtual Try-on for Any Combination of Attire Across Any Scenario [50.62711489896909]
AnyFit surpasses all baselines on high-resolution benchmarks and real-world data by a large gap. AnyFit's impressive performance on high-fidelity virtual try-ons in any scenario from any image, paves a new path for future research within the fashion community.
arXiv Detail & Related papers (2024-05-28T13:33:08Z)
Towards Robust and Expressive Whole-body Human Pose and Shape Estimation [51.457517178632756]
Whole-body pose and shape estimation aims to jointly predict different behaviors of the entire human body from a monocular image. Existing methods often exhibit degraded performance under the complexity of in-the-wild scenarios. We propose a novel framework to enhance the robustness of whole-body pose and shape estimation.
arXiv Detail & Related papers (2023-12-14T08:17:42Z)
C-VTON: Context-Driven Image-Based Virtual Try-On Network [1.0832844764942349]
We propose a Context-Driven Virtual Try-On Network (C-VTON) that convincingly transfers selected clothing items to the target subjects. At the core of the C-VTON pipeline are: (i) a geometric matching procedure that efficiently aligns the target clothing with the pose of the person in the input images, and (ii) a powerful image generator that utilizes various types of contextual information when the final try-on result.
arXiv Detail & Related papers (2022-12-08T17:56:34Z)
Drivable Volumetric Avatars using Texel-Aligned Features [52.89305658071045]
Photo telepresence requires both high-fidelity body modeling and faithful driving to enable dynamically synthesized appearance. We propose an end-to-end framework that addresses two core challenges in modeling and driving full-body avatars of real people.
arXiv Detail & Related papers (2022-07-20T09:28:16Z)
PT-VTON: an Image-Based Virtual Try-On Network with Progressive Pose Attention Transfer [11.96427084717743]
PT-VTON is a pose-transfer-based framework for cloth transfer that enables virtual try-on with arbitrary poses. PT-VTON can be applied to the fashion industry within minimal modification of existing systems.
arXiv Detail & Related papers (2021-11-23T21:51:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.