Related papers: TATTOO: Training-free AesTheTic-aware Outfit recOmmendation

TATTOO: Training-free AesTheTic-aware Outfit recOmmendation

URL: http://arxiv.org/abs/2509.23242v1
Date: Sat, 27 Sep 2025 10:46:55 GMT
Title: TATTOO: Training-free AesTheTic-aware Outfit recOmmendation
Authors: Yuntian Wu, Xiaonan Hu, Ziqi Zhou, Hao Lu,
Abstract summary: TATTOO is a Training-free AesTheTic-aware Outfit recommendation approach.<n>It first generates a target-item description using MLLMs, followed by an aesthetic chain-of-thought used to distill the images into a structured aesthetic profile.<n>Experiments on a real-world evaluation set Aesthetic-100 show that TATTOO achieves state-of-the-art performance compared with existing training-based methods.
Score: 9.087314807392415
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The global fashion e-commerce market relies significantly on intelligent and aesthetic-aware outfit-completion tools to promote sales. While previous studies have approached the problem of fashion outfit-completion and compatible-item retrieval, most of them require expensive, task-specific training on large-scale labeled data, and no effort is made to guide outfit recommendation with explicit human aesthetics. In the era of Multimodal Large Language Models (MLLMs), we show that the conventional training-based pipeline could be streamlined to a training-free paradigm, with better recommendation scores and enhanced aesthetic awareness. We achieve this with TATTOO, a Training-free AesTheTic-aware Outfit recommendation approach. It first generates a target-item description using MLLMs, followed by an aesthetic chain-of-thought used to distill the images into a structured aesthetic profile including color, style, occasion, season, material, and balance. By fusing the visual summary of the outfit with the textual description and aesthetics vectors using a dynamic entropy-gated mechanism, candidate items can be represented in a shared embedding space and be ranked accordingly. Experiments on a real-world evaluation set Aesthetic-100 show that TATTOO achieves state-of-the-art performance compared with existing training-based methods. Another standard Polyvore dataset is also used to measure the advanced zero-shot retrieval capability of our training-free method.

Related papers

AesRec: A Dataset for Aesthetics-Aligned Clothing Outfit Recommendation [17.478482513222826]
We present the AesRec benchmark dataset featuring systematic quantitative aesthetic annotations.<n>At the item level, six dimensions are independently assessed: silhouette, chromaticity, materiality, craftsmanship, wearability, and item-level impression.<n>We conduct rigorous human-machine consistency validation on a fashion dataset, confirming the reliability of the generated ratings.
arXiv Detail & Related papers (2026-02-03T11:44:00Z)
Learning to Synthesize Compatible Fashion Items Using Semantic Alignment and Collocation Classification: An Outfit Generation Framework [59.09707044733695]
We propose a novel outfit generation framework, i.e., OutfitGAN, with the aim of synthesizing an entire outfit.<n> OutfitGAN includes a semantic alignment module, which is responsible for characterizing the mapping correspondence between the existing fashion items and the synthesized ones.<n>In order to evaluate the performance of our proposed models, we built a large-scale dataset consisting of 20,000 fashion outfits.
arXiv Detail & Related papers (2025-02-05T12:13:53Z)
Decoding Style: Efficient Fine-Tuning of LLMs for Image-Guided Outfit Recommendation with Preference [4.667044856219814]
This paper presents a novel framework that harnesses the expressive power of large language models (LLMs) for personalized outfit recommendations. We bridge the item visual-textual gap in items descriptions by employing image captioning with a Multimodal Large Language Model (MLLM) The framework is evaluated on the Polyvore dataset, demonstrating its effectiveness in two key tasks: fill-in-the-blank, and complementary item retrieval.
arXiv Detail & Related papers (2024-09-18T17:15:06Z)
Lost Your Style? Navigating with Semantic-Level Approach for Text-to-Outfit Retrieval [2.07180164747172]
We introduce a groundbreaking approach to fashion recommendations: text-to-outfit retrieval task that generates a complete outfit set based solely on textual descriptions. Our model is devised at three semantic levels-item, style, and outfit-where each level progressively aggregates data to form a coherent outfit recommendation. Using the Maryland Polyvore and Polyvore Outfit datasets, our approach significantly outperformed state-of-the-art models in text-video retrieval tasks.
arXiv Detail & Related papers (2023-11-03T07:23:21Z)
Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models [64.24227572048075]
We propose a Knowledge-Aware Prompt Tuning (KAPT) framework for vision-language models. Our approach takes inspiration from human intelligence in which external knowledge is usually incorporated into recognizing novel categories of objects.
arXiv Detail & Related papers (2023-08-22T04:24:45Z)
VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining [53.470662123170555]
We propose learning image aesthetics from user comments, and exploring vision-language pretraining methods to learn multimodal aesthetic representations. Specifically, we pretrain an image-text encoder-decoder model with image-comment pairs, using contrastive and generative objectives to learn rich and generic aesthetic semantics without human labels. Our results show that our pretrained aesthetic vision-language model outperforms prior works on image aesthetic captioning over the AVA-Captions dataset.
arXiv Detail & Related papers (2023-03-24T23:57:28Z)
FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning [66.38951790650887]
Multimodal tasks in the fashion domain have significant potential for e-commerce. We propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs. We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks.
arXiv Detail & Related papers (2022-10-26T21:01:19Z)
Semi-Supervised Visual Representation Learning for Fashion Compatibility [17.893627646979038]
We propose a semi-supervised learning approach to create pseudo-positive and pseudo-negative outfits on the fly during training. For each labeled outfit in a training batch, we obtain a pseudo-outfit by matching each item in the labeled outfit with unlabeled items. We conduct extensive experiments on Polyvore, Polyvore-D and our newly created large-scale Fashion Outfits datasets.
arXiv Detail & Related papers (2021-09-16T15:35:38Z)
Self-supervised Visual Attribute Learning for Fashion Compatibility [71.73414832639698]
We present an SSL framework that enables us to learn color and texture-aware features without requiring any labels during training. Our approach consists of three self-supervised tasks designed to capture different concepts that are neglected in prior work. We show that our approach can be used for transfer learning, demonstrating that we can train on one dataset while achieving high performance on a different dataset.
arXiv Detail & Related papers (2020-08-01T21:53:22Z)
Learning Diverse Fashion Collocation by Neural Graph Filtering [78.9188246136867]
We propose a novel fashion collocation framework, Neural Graph Filtering, that models a flexible set of fashion items via a graph neural network. By applying symmetric operations on the edge vectors, this framework allows varying numbers of inputs/outputs and is invariant to their ordering. We evaluate the proposed approach on three popular benchmarks, the Polyvore dataset, the Polyvore-D dataset, and our reorganized Amazon Fashion dataset.
arXiv Detail & Related papers (2020-03-11T16:17:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.