Related papers: VTBench: Comprehensive Benchmark Suite Towards Real-World Virtual Try-on Models

VTBench: Comprehensive Benchmark Suite Towards Real-World Virtual Try-on Models

URL: http://arxiv.org/abs/2505.19571v1
Date: Mon, 26 May 2025 06:37:11 GMT
Title: VTBench: Comprehensive Benchmark Suite Towards Real-World Virtual Try-on Models
Authors: Hu Xiaobin, Liang Yujie, Luo Donghao, Peng Xu, Zhang Jiangning, Zhu Junwei, Wang Chengjie, Fu Yanwei,
Abstract summary: We introduce VTBench, a hierarchical benchmark suite that decomposes virtual image try-on into hierarchical, disentangled dimensions.<n>The benchmark encompasses five critical dimensions for virtual try-on generation.<n> VTBench will be open-sourced, including all test sets, evaluation protocols, generated results, and human annotations.
Score: 3.7098434045639874
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While virtual try-on has achieved significant progress, evaluating these models towards real-world scenarios remains a challenge. A comprehensive benchmark is essential for three key reasons:(1) Current metrics inadequately reflect human perception, particularly in unpaired try-on settings;(2)Most existing test sets are limited to indoor scenarios, lacking complexity for real-world evaluation; and (3) An ideal system should guide future advancements in virtual try-on generation. To address these needs, we introduce VTBench, a hierarchical benchmark suite that systematically decomposes virtual image try-on into hierarchical, disentangled dimensions, each equipped with tailored test sets and evaluation criteria. VTBench exhibits three key advantages:1) Multi-Dimensional Evaluation Framework: The benchmark encompasses five critical dimensions for virtual try-on generation (e.g., overall image quality, texture preservation, complex background consistency, cross-category size adaptability, and hand-occlusion handling). Granular evaluation metrics of corresponding test sets pinpoint model capabilities and limitations across diverse, challenging scenarios.2) Human Alignment: Human preference annotations are provided for each test set, ensuring the benchmark's alignment with perceptual quality across all evaluation dimensions. (3) Valuable Insights: Beyond standard indoor settings, we analyze model performance variations across dimensions and investigate the disparity between indoor and real-world try-on scenarios. To foster the field of virtual try-on towards challenging real-world scenario, VTBench will be open-sourced, including all test sets, evaluation protocols, generated results, and human annotations.

Related papers

OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation [14.782532923428084]
We present OpenVTON-Bench, a large-scale benchmark comprising approximately 100K high-resolution image pairs.<n>The dataset is constructed using DINOv3-based hierarchical clustering for semantically balanced sampling and Gemini-powered dense captioning.<n>We propose a multi-modal protocol that measures VTON quality along five interpretable dimensions: background consistency, identity fidelity, texture fidelity, shape plausibility, and overall realism.
arXiv Detail & Related papers (2026-01-30T08:58:00Z)
UniREditBench: A Unified Reasoning-based Image Editing Benchmark [52.54256348710893]
This work proposes UniREditBench, a unified benchmark for reasoning-based image editing evaluation.<n>It comprises 2,700 meticulously curated samples, covering both real- and game-world scenarios across 8 primary dimensions and 18 sub-dimensions.<n>We fine-tune Bagel on this dataset and develop UniREdit-Bagel, demonstrating substantial improvements in both in-domain and out-of-distribution settings.
arXiv Detail & Related papers (2025-11-03T07:24:57Z)
OutboundEval: A Dual-Dimensional Benchmark for Expert-Level Intelligent Outbound Evaluation of Xbench's Professional-Aligned Series [36.88936933010042]
OutboundEval is a comprehensive benchmark for evaluating large language models (LLMs) in intelligent outbound calling scenarios.<n>We design a benchmark spanning six major business domains and 30 representative sub-scenarios, each with scenario-specific process decomposition, weighted scoring, and domain-adaptive metrics.<n>We introduce a dynamic evaluation method that adapts to task variations, integrating automated and human-in-the-loop assessment to measure task execution accuracy, professional knowledge application, adaptability, and user experience quality.
arXiv Detail & Related papers (2025-10-24T08:27:58Z)
Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios [54.07895223545793]
This paper introduces the Real-World Robustness dataset (RRDataset) for comprehensive evaluation of detection models across three dimensions.<n>RRDataset includes high-quality images from seven major scenarios.<n>We benchmarked 17 detectors and 10 vision-language models (VLMs) on RRDataset and conducted a large-scale human study.
arXiv Detail & Related papers (2025-09-11T06:15:52Z)
Hi3DEval: Advancing 3D Generation Evaluation with Hierarchical Validity [78.7107376451476]
Hi3DEval is a hierarchical evaluation framework tailored for 3D generative content.<n>We extend texture evaluation beyond aesthetic appearance by explicitly assessing material realism.<n>We propose a 3D-aware automated scoring system based on hybrid 3D representations.
arXiv Detail & Related papers (2025-08-07T17:50:13Z)
VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning [10.497961559068493]
Visual transformation reasoning (VTR) is a vital cognitive capability that empowers intelligent agents to understand dynamic scenes.<n>Existing benchmarks suffer from a sim-to-real gap, limited task complexity, and incomplete reasoning coverage.<n>VisualTrans is the first comprehensive benchmark specifically designed for VTR in real-world human-object interaction scenarios.
arXiv Detail & Related papers (2025-08-06T03:07:05Z)
Adapting Vision-Language Models for Evaluating World Models [24.813041196394582]
We present UNIVERSE, a method for adapting Vision-language Evaluator for Rollouts in Simulated Environments under data and compute constraints.<n>We conduct a large-scale study comparing full, partial, and parameter-efficient finetuning across task formats, context lengths, sampling strategies, and data compositions.<n>The resulting unified evaluator matches the performance of task-specific baselines using a single checkpoint.
arXiv Detail & Related papers (2025-06-22T09:53:28Z)
PoseBench3D: A Cross-Dataset Analysis Framework for 3D Human Pose Estimation [1.470703050699957]
We present a standardized testing environment in which each method is evaluated on a variety of datasets.<n>We propose PoseBench3D, a unified framework designed to systematically re-evaluate prior and future models.
arXiv Detail & Related papers (2025-05-16T05:49:23Z)
CrossVTON: Mimicking the Logic Reasoning on Cross-category Virtual Try-on guided by Tri-zone Priors [63.95051258676488]
CrossVTON is a framework for generating robust fitting images for cross-category virtual try-on.<n>It disentangles the complex reasoning required for cross-category try-on into a structured framework.<n>It achieves state-of-the-art performance, surpassing existing baselines in both qualitative and quantitative evaluations.
arXiv Detail & Related papers (2025-02-20T09:05:35Z)
WorldSimBench: Towards Video Generation Models as World Simulators [79.69709361730865]
We classify the functionalities of predictive models into a hierarchy and take the first step in evaluating World Simulators by proposing a dual evaluation framework called WorldSimBench. WorldSimBench includes Explicit Perceptual Evaluation and Implicit Manipulative Evaluation, encompassing human preference assessments from the visual perspective and action-level evaluations in embodied tasks. Our comprehensive evaluation offers key insights that can drive further innovation in video generation models, positioning World Simulators as a pivotal advancement toward embodied artificial intelligence.
arXiv Detail & Related papers (2024-10-23T17:56:11Z)
BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation [57.40024206484446]
We introduce the BEHAVIOR Vision Suite (BVS), a set of tools and assets to generate fully customized synthetic data for systematic evaluation of computer vision models. BVS supports a large number of adjustable parameters at the scene level. We showcase three example application scenarios.
arXiv Detail & Related papers (2024-05-15T17:57:56Z)
VBench: Comprehensive Benchmark Suite for Video Generative Models [100.43756570261384]
VBench is a benchmark suite that dissects "video generation quality" into specific, hierarchical, and disentangled dimensions. We provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception. We will open-source VBench, including all prompts, evaluation methods, generated videos, and human preference annotations.
arXiv Detail & Related papers (2023-11-29T18:39:01Z)
UMSE: Unified Multi-scenario Summarization Evaluation [52.60867881867428]
Summarization quality evaluation is a non-trivial task in text summarization. We propose Unified Multi-scenario Summarization Evaluation Model (UMSE) Our UMSE is the first unified summarization evaluation framework engaged with the ability to be used in three evaluation scenarios.
arXiv Detail & Related papers (2023-05-26T12:54:44Z)
Uncertainty-Aware Adaptation for Self-Supervised 3D Human Pose Estimation [70.32536356351706]
We introduce MRP-Net that constitutes a common deep network backbone with two output heads subscribing to two diverse configurations. We derive suitable measures to quantify prediction uncertainty at both pose and joint level. We present a comprehensive evaluation of the proposed approach and demonstrate state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2022-03-29T07:14:58Z)
Sim2Real Object-Centric Keypoint Detection and Description [40.58367357980036]
Keypoint detection and description play a central role in computer vision. We propose the object-centric formulation, which requires further identifying which object each interest point belongs to. We develop a sim2real contrastive learning mechanism that can generalize the model trained in simulation to real-world applications.
arXiv Detail & Related papers (2022-02-01T15:00:20Z)
SVIRO: Synthetic Vehicle Interior Rear Seat Occupancy Dataset and Benchmark [11.101588888002045]
We release SVIRO, a synthetic dataset for sceneries in the passenger compartment of ten different vehicles. We analyze machine learning-based approaches for their generalization capacities and reliability when trained on a limited number of variations.
arXiv Detail & Related papers (2020-01-10T14:44:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.