Related papers: OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation

OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation

URL: http://arxiv.org/abs/2601.22725v1
Date: Fri, 30 Jan 2026 08:58:00 GMT
Title: OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation
Authors: Jin Li, Tao Chen, Shuai Jiang, Weijie Wang, Jingwen Luo, Chenhui Wu,
Abstract summary: We present OpenVTON-Bench, a large-scale benchmark comprising approximately 100K high-resolution image pairs.<n>The dataset is constructed using DINOv3-based hierarchical clustering for semantically balanced sampling and Gemini-powered dense captioning.<n>We propose a multi-modal protocol that measures VTON quality along five interpretable dimensions: background consistency, identity fidelity, texture fidelity, shape plausibility, and overall realism.
Score: 14.782532923428084
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in diffusion models have significantly elevated the visual fidelity of Virtual Try-On (VTON) systems, yet reliable evaluation remains a persistent bottleneck. Traditional metrics struggle to quantify fine-grained texture details and semantic consistency, while existing datasets fail to meet commercial standards in scale and diversity. We present OpenVTON-Bench, a large-scale benchmark comprising approximately 100K high-resolution image pairs (up to $1536 \times 1536$). The dataset is constructed using DINOv3-based hierarchical clustering for semantically balanced sampling and Gemini-powered dense captioning, ensuring a uniform distribution across 20 fine-grained garment categories. To support reliable evaluation, we propose a multi-modal protocol that measures VTON quality along five interpretable dimensions: background consistency, identity fidelity, texture fidelity, shape plausibility, and overall realism. The protocol integrates VLM-based semantic reasoning with a novel Multi-Scale Representation Metric based on SAM3 segmentation and morphological erosion, enabling the separation of boundary alignment errors from internal texture artifacts. Experimental results show strong agreement with human judgments (Kendall's $τ$ of 0.833 vs. 0.611 for SSIM), establishing a robust benchmark for VTON evaluation.

Related papers

Structure-Informed Estimation for Pilot-Limited MIMO Channels via Tensor Decomposition [51.56484100374058]
This paper formulates pilot-limited channel estimation as low-rank tensor completion from sparse observations.<n>Experiments on synthetic channels demonstrate 10-20,dB normalized mean-square error (NMSE) improvement over least-squares (LS)<n> evaluations on DeepMIMO ray-tracing channels show 24-44% additional NMSE reduction over pure tensor-based methods.
arXiv Detail & Related papers (2026-02-03T23:38:05Z)
VTONQA: A Multi-Dimensional Quality Assessment Dataset for Virtual Try-on [83.39966045949338]
VTONQA is the first multi-dimensional quality assessment dataset specifically designed for VTON.<n>It contains 8,132 images generated by 11 representative VTON models, along with 24,396 mean opinion scores (MOSs) across three evaluation dimensions.<n>We benchmark both VTON models and a diverse set of image quality assessment (IQA) metrics, revealing the limitations of existing methods.
arXiv Detail & Related papers (2026-01-06T11:42:26Z)
Visual Autoregressive Modelling for Monocular Depth Estimation [69.01449528371916]
We propose a monocular depth estimation method based on visual autoregressive ( VAR) priors.<n>Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism.<n>We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets.
arXiv Detail & Related papers (2025-12-27T17:08:03Z)
DAMBench: A Multi-Modal Benchmark for Deep Learning-based Atmospheric Data Assimilation [14.776071715723262]
We introduce DAMBench, the first large-scale multi-modal benchmark to evaluate data-driven DA models under realistic atmospheric conditions.<n> DAMBench integrates high-quality background states from state-of-the-art forecasting systems and real-world multi-modal observations.<n>We provide unified evaluation protocols and benchmark representative data assimilation approaches.
arXiv Detail & Related papers (2025-11-03T11:26:26Z)
Knowledge-Informed Neural Network for Complex-Valued SAR Image Recognition [51.03674130115878]
We introduce the Knowledge-Informed Neural Network (KINN), a lightweight framework built upon a novel "compression-aggregation-compression" architecture.<n>KINN establishes a state-of-the-art in parameter-efficient recognition, offering exceptional generalization in data-scarce and out-of-distribution scenarios.
arXiv Detail & Related papers (2025-10-23T07:12:26Z)
Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input [25.671340854789236]
textbfRes-Bench is a benchmark comprising 14,400 samples across 12 resolution levels and six core capability dimensions.<n>This framework introduces multiple robustness metrics: Spearman's correlation for assessing resolution-performance trends, and Absolute/Relative Continuous Error (ACE/RCE) for measuring performance volatility.<n>Our analysis encompasses: (1) model-centric and task-centric robustness examination, (2) investigation of preprocessing strategies including padding and super-resolution, and (3) exploration of fine-tuning for stability enhancement.
arXiv Detail & Related papers (2025-10-19T16:53:01Z)
Test-Time Consistency in Vision Language Models [26.475993408532304]
Vision-Language Models (VLMs) have achieved impressive performance across a wide range of multimodal tasks.<n>Recent benchmarks, such as MM-R3, highlight that even state-of-the-art VLMs can produce divergent predictions across semantically equivalent inputs.<n>We propose a simple and effective test-time consistency framework that enhances semantic consistency without supervised re-training.
arXiv Detail & Related papers (2025-06-27T17:09:44Z)
VTBench: Comprehensive Benchmark Suite Towards Real-World Virtual Try-on Models [3.7098434045639874]
We introduce VTBench, a hierarchical benchmark suite that decomposes virtual image try-on into hierarchical, disentangled dimensions.<n>The benchmark encompasses five critical dimensions for virtual try-on generation.<n> VTBench will be open-sourced, including all test sets, evaluation protocols, generated results, and human annotations.
arXiv Detail & Related papers (2025-05-26T06:37:11Z)
DEFOM-Stereo: Depth Foundation Model Based Stereo Matching [12.22373236061929]
DEFOM-Stereo is built to facilitate robust stereo matching with monocular depth cues.<n>It is verified to have much stronger zero-shot generalization compared with SOTA methods.<n>Our model simultaneously outperforms previous models on the individual benchmarks.
arXiv Detail & Related papers (2025-01-16T10:59:29Z)
Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection [70.84835546732738]
RGB-Thermal Salient Object Detection aims to pinpoint prominent objects within aligned pairs of visible and thermal infrared images.<n>Traditional encoder-decoder architectures may not have adequately considered the robustness against noise originating from defective modalities.<n>We propose the ConTriNet, a robust Confluent Triple-Flow Network employing a Divide-and-Conquer strategy.
arXiv Detail & Related papers (2024-12-02T14:44:39Z)
Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction [49.15931834209624]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.<n>We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.<n>By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z)
Trusted Multi-View Classification [76.73585034192894]
We propose a novel multi-view classification method, termed trusted multi-view classification. It provides a new paradigm for multi-view learning by dynamically integrating different views at an evidence level. The proposed algorithm jointly utilizes multiple views to promote both classification reliability and robustness.
arXiv Detail & Related papers (2021-02-03T13:30:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.