Related papers: LookBench: A Live and Holistic Open Benchmark for Fashion Image Retrieval

LookBench: A Live and Holistic Open Benchmark for Fashion Image Retrieval

URL: http://arxiv.org/abs/2601.14706v1
Date: Wed, 21 Jan 2026 06:50:23 GMT
Title: LookBench: A Live and Holistic Open Benchmark for Fashion Image Retrieval
Authors: Chao Gao, Siqiao Xue, Yimin Peng, Jiwen Fu, Tingyi Gu, Shanshan Li, Fan Zhou,
Abstract summary: We present LookBench, a live, holistic and challenging benchmark for fashion image retrieval in real e-commerce settings.<n>LookBench includes both recent product images sourced from live websites and AI-generated fashion images.<n>Our experiments reveal that LookBench poses a significant challenge on strong baselines.
Score: 28.812948794614034
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we present LookBench (We use the term "look" to reflect retrieval that mirrors how people shop -- finding the exact item, a close substitute, or a visually consistent alternative.), a live, holistic and challenging benchmark for fashion image retrieval in real e-commerce settings. LookBench includes both recent product images sourced from live websites and AI-generated fashion images, reflecting contemporary trends and use cases. Each test sample is time-stamped and we intend to update the benchmark periodically, enabling contamination-aware evaluation aligned with declared training cutoffs. Grounded in our fine-grained attribute taxonomy, LookBench covers single-item and outfit-level retrieval across. Our experiments reveal that LookBench poses a significant challenge on strong baselines, with many models achieving below $60\%$ Recall@1. Our proprietary model achieves the best performance on LookBench, and we release an open-source counterpart that ranks second, with both models attaining state-of-the-art results on legacy Fashion200K evaluations. LookBench is designed to be updated semi-annually with new test samples and progressively harder task variants, providing a durable measure of progress. We publicly release our leaderboard, dataset, evaluation code, and trained models.

Related papers

Constantly Improving Image Models Need Constantly Improving Benchmarks [109.39018167487103]
We present ECHO, a framework for constructing benchmarks directly from real-world evidence of model use.<n>Applying this framework to GPT-4o Image Gen, we construct a dataset of over 31,000 prompts curated from social media posts.
arXiv Detail & Related papers (2025-10-16T17:59:30Z)
MR$^2$-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval [86.35779264575154]
Multimodal retrieval is becoming a crucial component of modern AI applications, yet its evaluation lags behind the demands of more realistic and challenging scenarios.<n>We introduce MR$2$-Bench, a reasoning-intensive benchmark for multimodal retrieval.
arXiv Detail & Related papers (2025-09-30T15:09:14Z)
ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation [51.297873393639456]
ArtifactsBench is a framework for automated visual code generation evaluation.<n>Our framework renders each generated artifact and captures its dynamic behavior through temporal screenshots.<n>We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading Large Language Models.
arXiv Detail & Related papers (2025-07-07T12:53:00Z)
RewardBench 2: Advancing Reward Model Evaluation [71.65938693914153]
Reward models are used throughout the post-training of language models to capture nuanced signals from preference data.<n>The community has begun establishing best practices for evaluating reward models.<n>This paper introduces RewardBench 2, a new multi-skill reward modeling benchmark.
arXiv Detail & Related papers (2025-06-02T17:54:04Z)
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.<n>LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.<n>We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z)
LiveBench: A Challenging, Contamination-Limited LLM Benchmark [93.57775429120488]
We release LiveBench, the first benchmark that contains frequently-updated questions from recent information sources.<n>We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 405B in size.<n>Questions are added and updated on a monthly basis, and we release new tasks and harder versions of tasks over time.
arXiv Detail & Related papers (2024-06-27T16:47:42Z)
LRVS-Fashion: Extending Visual Search with Referring Instructions [13.590668564555195]
We present Referred Visual Search (RVS), a task allowing users to define more precisely the desired similarity. We release a new large public dataset, LRVS-Fashion, consisting of 272k fashion products with 842k images extracted from fashion catalogs.
arXiv Detail & Related papers (2023-06-05T14:45:38Z)
HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models [39.38477117444303]
HRS-Bench is an evaluation benchmark for Text-to-Image (T2I) models. It measures 13 skills that can be categorized into five major categories: accuracy, robustness, generalization, fairness, and bias. It covers 50 scenarios, including fashion, animals, transportation, food, and clothes.
arXiv Detail & Related papers (2023-04-11T17:59:13Z)
GEFF: Improving Any Clothes-Changing Person ReID Model using Gallery Enrichment with Face Features [11.189236254478057]
In Clothes-Changing Re-Identification (CC-ReID) problem, given a query sample of a person, the goal is to determine the correct identity based on a labeled gallery in which the person appears in different clothes. Several models tackle this challenge by extracting clothes-independent features. As clothing-related features are often dominant features in the data, we propose a new process we call Gallery Enrichment.
arXiv Detail & Related papers (2022-11-24T21:41:52Z)
A Strong Baseline for Fashion Retrieval with Person Re-Identification Models [0.0]
Fashion retrieval is the challenging task of finding an exact match for fashion items contained within an image. We introduce a simple baseline model for fashion retrieval, significantly outperforming previous state-of-the-art results. We conduct in-depth experiments on Street2Shop and DeepFashion datasets and validate our results.
arXiv Detail & Related papers (2020-03-09T12:50:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.