VTONQA: A Multi-Dimensional Quality Assessment Dataset for Virtual Try-on
- URL: http://arxiv.org/abs/2601.02945v1
- Date: Tue, 06 Jan 2026 11:42:26 GMT
- Title: VTONQA: A Multi-Dimensional Quality Assessment Dataset for Virtual Try-on
- Authors: Xinyi Wei, Sijing Wu, Zitong Xu, Yunhao Li, Huiyu Duan, Xiongkuo Min, Guangtao Zhai,
- Abstract summary: VTONQA is the first multi-dimensional quality assessment dataset specifically designed for VTON.<n>It contains 8,132 images generated by 11 representative VTON models, along with 24,396 mean opinion scores (MOSs) across three evaluation dimensions.<n>We benchmark both VTON models and a diverse set of image quality assessment (IQA) metrics, revealing the limitations of existing methods.
- Score: 83.39966045949338
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rapid development of e-commerce and digital fashion, image-based virtual try-on (VTON) has attracted increasing attention. However, existing VTON models often suffer from artifacts such as garment distortion and body inconsistency, highlighting the need for reliable quality evaluation of VTON-generated images. To this end, we construct VTONQA, the first multi-dimensional quality assessment dataset specifically designed for VTON, which contains 8,132 images generated by 11 representative VTON models, along with 24,396 mean opinion scores (MOSs) across three evaluation dimensions (i.e., clothing fit, body compatibility, and overall quality). Based on VTONQA, we benchmark both VTON models and a diverse set of image quality assessment (IQA) metrics, revealing the limitations of existing methods and highlighting the value of the proposed dataset. We believe that the VTONQA dataset and corresponding benchmarks will provide a solid foundation for perceptually aligned evaluation, benefiting both the development of quality assessment methods and the advancement of VTON models.
Related papers
- OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation [14.782532923428084]
We present OpenVTON-Bench, a large-scale benchmark comprising approximately 100K high-resolution image pairs.<n>The dataset is constructed using DINOv3-based hierarchical clustering for semantically balanced sampling and Gemini-powered dense captioning.<n>We propose a multi-modal protocol that measures VTON quality along five interpretable dimensions: background consistency, identity fidelity, texture fidelity, shape plausibility, and overall realism.
arXiv Detail & Related papers (2026-01-30T08:58:00Z) - Evaluating and Preserving High-level Fidelity in Super-Resolution [50.65679806442527]
Super-Resolution (SR) models are achieving impressive effects in reconstructing details and delivering pleasant visually outputs.<n>However, the overpowering generative ability can sometimes hallucinate and thus change the image content.<n>This type of high-level change can be easily identified by humans yet not well-studied in existing low-level image quality metrics.
arXiv Detail & Related papers (2025-12-07T22:53:34Z) - Rethinking Garment Conditioning in Diffusion-based Virtual Try-On [7.386027762996787]
We develop Re-CatVTON, an efficient single UNet model that achieves high performance.<n>The proposed Re-CatVTON significantly improves performance compared to its predecessor.<n>Our results demonstrate improved FID, KID, and LPIPS scores, with only a marginal decrease in SSIM.
arXiv Detail & Related papers (2025-11-24T05:19:44Z) - AvatarVTON: 4D Virtual Try-On for Animatable Avatars [67.13031660684457]
AvatarVTON generates realistic try-on results from a single in-shop garment image.<n>It supports dynamic garment interactions under single-view supervision.<n>It is well-suited for AR/VR, gaming, and digital-human applications.
arXiv Detail & Related papers (2025-10-06T14:06:34Z) - VTBench: Evaluating Visual Tokenizers for Autoregressive Image Generation [11.529598741483076]
A visual tokenizer (VT) maps continuous pixel inputs to discrete token sequences.<n>Current discrete VTs fall significantly behind continuous variational autoencoders (VAEs), leading to degraded image reconstructions and poor preservation of details and text.<n>Existing benchmarks focus on end-to-end generation quality, without isolating VT performance.<n>We introduce VTBench, a comprehensive benchmark that systematically evaluates VTs across three core tasks: Image Reconstruction, Detail Preservation, and Text Preservation.
arXiv Detail & Related papers (2025-05-19T17:59:01Z) - VTON 360: High-Fidelity Virtual Try-On from Any Viewing Direction [103.0918705283309]
Virtual Try-On (VTON) is a transformative technology in e-commerce and fashion design, enabling realistic digital visualization of clothing on individuals.<n>We propose VTON 360, a novel 3D VTON method that addresses the open challenge of achieving high-fidelity VTON that supports any-view rendering.
arXiv Detail & Related papers (2025-03-15T15:08:48Z) - TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models [8.158200403139196]
We introduce Virtual Try-Off (VTOFF), a novel task generating standardized garment images from single photos of clothed individuals.<n>TryOffDiff adapts Stable Diffusion with SigLIP-based visual conditioning to deliver high-fidelity reconstructions.<n>Our findings highlight VTOFF's potential to improve e-commerce product imagery, advance generative model evaluation, and guide future research on high-fidelity reconstruction.
arXiv Detail & Related papers (2024-11-27T13:53:09Z) - Opinion-Unaware Blind Image Quality Assessment using Multi-Scale Deep Feature Statistics [54.08757792080732]
We propose integrating deep features from pre-trained visual models with a statistical analysis model to achieve opinion-unaware BIQA (OU-BIQA)
Our proposed model exhibits superior consistency with human visual perception compared to state-of-the-art BIQA models.
arXiv Detail & Related papers (2024-05-29T06:09:34Z) - CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms.
Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner.
Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z) - Visual Mechanisms Inspired Efficient Transformers for Image and Video
Quality Assessment [5.584060970507507]
Perceptual mechanisms in the human visual system play a crucial role in the generation of quality perception.
This paper proposes a general framework for no-reference visual quality assessment using efficient windowed transformer architectures.
arXiv Detail & Related papers (2022-03-28T07:55:11Z) - FUNQUE: Fusion of Unified Quality Evaluators [42.41484412777326]
Fusion-based quality assessment has emerged as a powerful method for developing high-performance quality models.
We propose FUNQUE, a quality model that fuses unified quality evaluators.
arXiv Detail & Related papers (2022-02-23T00:21:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.