OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation
- URL: http://arxiv.org/abs/2601.22725v1
- Date: Fri, 30 Jan 2026 08:58:00 GMT
- Title: OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation
- Authors: Jin Li, Tao Chen, Shuai Jiang, Weijie Wang, Jingwen Luo, Chenhui Wu,
- Abstract summary: We present OpenVTON-Bench, a large-scale benchmark comprising approximately 100K high-resolution image pairs.<n>The dataset is constructed using DINOv3-based hierarchical clustering for semantically balanced sampling and Gemini-powered dense captioning.<n>We propose a multi-modal protocol that measures VTON quality along five interpretable dimensions: background consistency, identity fidelity, texture fidelity, shape plausibility, and overall realism.
- Score: 14.782532923428084
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in diffusion models have significantly elevated the visual fidelity of Virtual Try-On (VTON) systems, yet reliable evaluation remains a persistent bottleneck. Traditional metrics struggle to quantify fine-grained texture details and semantic consistency, while existing datasets fail to meet commercial standards in scale and diversity. We present OpenVTON-Bench, a large-scale benchmark comprising approximately 100K high-resolution image pairs (up to $1536 \times 1536$). The dataset is constructed using DINOv3-based hierarchical clustering for semantically balanced sampling and Gemini-powered dense captioning, ensuring a uniform distribution across 20 fine-grained garment categories. To support reliable evaluation, we propose a multi-modal protocol that measures VTON quality along five interpretable dimensions: background consistency, identity fidelity, texture fidelity, shape plausibility, and overall realism. The protocol integrates VLM-based semantic reasoning with a novel Multi-Scale Representation Metric based on SAM3 segmentation and morphological erosion, enabling the separation of boundary alignment errors from internal texture artifacts. Experimental results show strong agreement with human judgments (Kendall's $τ$ of 0.833 vs. 0.611 for SSIM), establishing a robust benchmark for VTON evaluation.
Related papers
- Structure-Informed Estimation for Pilot-Limited MIMO Channels via Tensor Decomposition [51.56484100374058]
This paper formulates pilot-limited channel estimation as low-rank tensor completion from sparse observations.<n>Experiments on synthetic channels demonstrate 10-20,dB normalized mean-square error (NMSE) improvement over least-squares (LS)<n> evaluations on DeepMIMO ray-tracing channels show 24-44% additional NMSE reduction over pure tensor-based methods.
arXiv Detail & Related papers (2026-02-03T23:38:05Z) - VTONQA: A Multi-Dimensional Quality Assessment Dataset for Virtual Try-on [83.39966045949338]
VTONQA is the first multi-dimensional quality assessment dataset specifically designed for VTON.<n>It contains 8,132 images generated by 11 representative VTON models, along with 24,396 mean opinion scores (MOSs) across three evaluation dimensions.<n>We benchmark both VTON models and a diverse set of image quality assessment (IQA) metrics, revealing the limitations of existing methods.
arXiv Detail & Related papers (2026-01-06T11:42:26Z) - Visual Autoregressive Modelling for Monocular Depth Estimation [69.01449528371916]
We propose a monocular depth estimation method based on visual autoregressive ( VAR) priors.<n>Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism.<n>We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets.
arXiv Detail & Related papers (2025-12-27T17:08:03Z) - DAMBench: A Multi-Modal Benchmark for Deep Learning-based Atmospheric Data Assimilation [14.776071715723262]
We introduce DAMBench, the first large-scale multi-modal benchmark to evaluate data-driven DA models under realistic atmospheric conditions.<n> DAMBench integrates high-quality background states from state-of-the-art forecasting systems and real-world multi-modal observations.<n>We provide unified evaluation protocols and benchmark representative data assimilation approaches.
arXiv Detail & Related papers (2025-11-03T11:26:26Z) - Knowledge-Informed Neural Network for Complex-Valued SAR Image Recognition [51.03674130115878]
We introduce the Knowledge-Informed Neural Network (KINN), a lightweight framework built upon a novel "compression-aggregation-compression" architecture.<n>KINN establishes a state-of-the-art in parameter-efficient recognition, offering exceptional generalization in data-scarce and out-of-distribution scenarios.
arXiv Detail & Related papers (2025-10-23T07:12:26Z) - Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input [25.671340854789236]
textbfRes-Bench is a benchmark comprising 14,400 samples across 12 resolution levels and six core capability dimensions.<n>This framework introduces multiple robustness metrics: Spearman's correlation for assessing resolution-performance trends, and Absolute/Relative Continuous Error (ACE/RCE) for measuring performance volatility.<n>Our analysis encompasses: (1) model-centric and task-centric robustness examination, (2) investigation of preprocessing strategies including padding and super-resolution, and (3) exploration of fine-tuning for stability enhancement.
arXiv Detail & Related papers (2025-10-19T16:53:01Z) - Test-Time Consistency in Vision Language Models [26.475993408532304]
Vision-Language Models (VLMs) have achieved impressive performance across a wide range of multimodal tasks.<n>Recent benchmarks, such as MM-R3, highlight that even state-of-the-art VLMs can produce divergent predictions across semantically equivalent inputs.<n>We propose a simple and effective test-time consistency framework that enhances semantic consistency without supervised re-training.
arXiv Detail & Related papers (2025-06-27T17:09:44Z) - VTBench: Comprehensive Benchmark Suite Towards Real-World Virtual Try-on Models [3.7098434045639874]
We introduce VTBench, a hierarchical benchmark suite that decomposes virtual image try-on into hierarchical, disentangled dimensions.<n>The benchmark encompasses five critical dimensions for virtual try-on generation.<n> VTBench will be open-sourced, including all test sets, evaluation protocols, generated results, and human annotations.
arXiv Detail & Related papers (2025-05-26T06:37:11Z) - DEFOM-Stereo: Depth Foundation Model Based Stereo Matching [12.22373236061929]
DEFOM-Stereo is built to facilitate robust stereo matching with monocular depth cues.<n>It is verified to have much stronger zero-shot generalization compared with SOTA methods.<n>Our model simultaneously outperforms previous models on the individual benchmarks.
arXiv Detail & Related papers (2025-01-16T10:59:29Z) - Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection [70.84835546732738]
RGB-Thermal Salient Object Detection aims to pinpoint prominent objects within aligned pairs of visible and thermal infrared images.<n>Traditional encoder-decoder architectures may not have adequately considered the robustness against noise originating from defective modalities.<n>We propose the ConTriNet, a robust Confluent Triple-Flow Network employing a Divide-and-Conquer strategy.
arXiv Detail & Related papers (2024-12-02T14:44:39Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction [49.15931834209624]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.<n>We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.<n>By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - Trusted Multi-View Classification [76.73585034192894]
We propose a novel multi-view classification method, termed trusted multi-view classification.
It provides a new paradigm for multi-view learning by dynamically integrating different views at an evidence level.
The proposed algorithm jointly utilizes multiple views to promote both classification reliability and robustness.
arXiv Detail & Related papers (2021-02-03T13:30:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.