LookBench: A Live and Holistic Open Benchmark for Fashion Image Retrieval
- URL: http://arxiv.org/abs/2601.14706v1
- Date: Wed, 21 Jan 2026 06:50:23 GMT
- Title: LookBench: A Live and Holistic Open Benchmark for Fashion Image Retrieval
- Authors: Chao Gao, Siqiao Xue, Yimin Peng, Jiwen Fu, Tingyi Gu, Shanshan Li, Fan Zhou,
- Abstract summary: We present LookBench, a live, holistic and challenging benchmark for fashion image retrieval in real e-commerce settings.<n>LookBench includes both recent product images sourced from live websites and AI-generated fashion images.<n>Our experiments reveal that LookBench poses a significant challenge on strong baselines.
- Score: 28.812948794614034
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present LookBench (We use the term "look" to reflect retrieval that mirrors how people shop -- finding the exact item, a close substitute, or a visually consistent alternative.), a live, holistic and challenging benchmark for fashion image retrieval in real e-commerce settings. LookBench includes both recent product images sourced from live websites and AI-generated fashion images, reflecting contemporary trends and use cases. Each test sample is time-stamped and we intend to update the benchmark periodically, enabling contamination-aware evaluation aligned with declared training cutoffs. Grounded in our fine-grained attribute taxonomy, LookBench covers single-item and outfit-level retrieval across. Our experiments reveal that LookBench poses a significant challenge on strong baselines, with many models achieving below $60\%$ Recall@1. Our proprietary model achieves the best performance on LookBench, and we release an open-source counterpart that ranks second, with both models attaining state-of-the-art results on legacy Fashion200K evaluations. LookBench is designed to be updated semi-annually with new test samples and progressively harder task variants, providing a durable measure of progress. We publicly release our leaderboard, dataset, evaluation code, and trained models.
Related papers
- Constantly Improving Image Models Need Constantly Improving Benchmarks [109.39018167487103]
We present ECHO, a framework for constructing benchmarks directly from real-world evidence of model use.<n>Applying this framework to GPT-4o Image Gen, we construct a dataset of over 31,000 prompts curated from social media posts.
arXiv Detail & Related papers (2025-10-16T17:59:30Z) - MR$^2$-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval [86.35779264575154]
Multimodal retrieval is becoming a crucial component of modern AI applications, yet its evaluation lags behind the demands of more realistic and challenging scenarios.<n>We introduce MR$2$-Bench, a reasoning-intensive benchmark for multimodal retrieval.
arXiv Detail & Related papers (2025-09-30T15:09:14Z) - ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation [51.297873393639456]
ArtifactsBench is a framework for automated visual code generation evaluation.<n>Our framework renders each generated artifact and captures its dynamic behavior through temporal screenshots.<n>We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading Large Language Models.
arXiv Detail & Related papers (2025-07-07T12:53:00Z) - RewardBench 2: Advancing Reward Model Evaluation [71.65938693914153]
Reward models are used throughout the post-training of language models to capture nuanced signals from preference data.<n>The community has begun establishing best practices for evaluating reward models.<n>This paper introduces RewardBench 2, a new multi-skill reward modeling benchmark.
arXiv Detail & Related papers (2025-06-02T17:54:04Z) - LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.<n>LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.<n>We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - LiveBench: A Challenging, Contamination-Limited LLM Benchmark [93.57775429120488]
We release LiveBench, the first benchmark that contains frequently-updated questions from recent information sources.<n>We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 405B in size.<n>Questions are added and updated on a monthly basis, and we release new tasks and harder versions of tasks over time.
arXiv Detail & Related papers (2024-06-27T16:47:42Z) - LRVS-Fashion: Extending Visual Search with Referring Instructions [13.590668564555195]
We present Referred Visual Search (RVS), a task allowing users to define more precisely the desired similarity.
We release a new large public dataset, LRVS-Fashion, consisting of 272k fashion products with 842k images extracted from fashion catalogs.
arXiv Detail & Related papers (2023-06-05T14:45:38Z) - HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image
Models [39.38477117444303]
HRS-Bench is an evaluation benchmark for Text-to-Image (T2I) models.
It measures 13 skills that can be categorized into five major categories: accuracy, robustness, generalization, fairness, and bias.
It covers 50 scenarios, including fashion, animals, transportation, food, and clothes.
arXiv Detail & Related papers (2023-04-11T17:59:13Z) - GEFF: Improving Any Clothes-Changing Person ReID Model using Gallery
Enrichment with Face Features [11.189236254478057]
In Clothes-Changing Re-Identification (CC-ReID) problem, given a query sample of a person, the goal is to determine the correct identity based on a labeled gallery in which the person appears in different clothes.
Several models tackle this challenge by extracting clothes-independent features.
As clothing-related features are often dominant features in the data, we propose a new process we call Gallery Enrichment.
arXiv Detail & Related papers (2022-11-24T21:41:52Z) - A Strong Baseline for Fashion Retrieval with Person Re-Identification
Models [0.0]
Fashion retrieval is the challenging task of finding an exact match for fashion items contained within an image.
We introduce a simple baseline model for fashion retrieval, significantly outperforming previous state-of-the-art results.
We conduct in-depth experiments on Street2Shop and DeepFashion datasets and validate our results.
arXiv Detail & Related papers (2020-03-09T12:50:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.