FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal
Retrieval
- URL: http://arxiv.org/abs/2005.09801v2
- Date: Fri, 29 May 2020 05:56:10 GMT
- Title: FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal
Retrieval
- Authors: Dehong Gao, Linbo Jin, Ben Chen, Minghui Qiu, Peng Li, Yi Wei, Yi Hu
and Hao Wang
- Abstract summary: FashionBERT learns high level representations of texts and images.
FashionBERT achieves significant improvements in performances than the baseline and state-of-the-art approaches.
- Score: 31.822218310945036
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we address the text and image matching in cross-modal
retrieval of the fashion industry. Different from the matching in the general
domain, the fashion matching is required to pay much more attention to the
fine-grained information in the fashion images and texts. Pioneer approaches
detect the region of interests (i.e., RoIs) from images and use the RoI
embeddings as image representations. In general, RoIs tend to represent the
"object-level" information in the fashion images, while fashion texts are prone
to describe more detailed information, e.g. styles, attributes. RoIs are thus
not fine-grained enough for fashion text and image matching. To this end, we
propose FashionBERT, which leverages patches as image features. With the
pre-trained BERT model as the backbone network, FashionBERT learns high level
representations of texts and images. Meanwhile, we propose an adaptive loss to
trade off multitask learning in the FashionBERT modeling. Two tasks (i.e., text
and image matching and cross-modal retrieval) are incorporated to evaluate
FashionBERT. On the public dataset, experiments demonstrate FashionBERT
achieves significant improvements in performances than the baseline and
state-of-the-art approaches. In practice, FashionBERT is applied in a concrete
cross-modal retrieval application. We provide the detailed matching performance
and inference efficiency analysis.
Related papers
- Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers [120.49126407479717]
This paper explores text-to-image diffusion models for Zero-Shot Sketch-based Image Retrieval (ZS-SBIR)
We highlight a pivotal discovery: the capacity of text-to-image diffusion models to seamlessly bridge the gap between sketches and photos.
arXiv Detail & Related papers (2024-03-12T00:02:03Z) - Social Media Fashion Knowledge Extraction as Captioning [61.41631195195498]
We study the task of social media fashion knowledge extraction.
We transform the fashion knowledges into a natural language caption with a sentence transformation method.
Our framework then aims to generate the sentence-based fashion knowledge directly from the social media post.
arXiv Detail & Related papers (2023-09-28T09:07:48Z) - FashionTex: Controllable Virtual Try-on with Text and Texture [29.7855591607239]
We propose a multi-modal interactive setting by combining the advantages of both text and texture for multi-level fashion manipulation.
FashionTex framework can semantically control cloth types and local texture patterns without annotated pairwise training data.
arXiv Detail & Related papers (2023-05-08T04:10:36Z) - FashionSAP: Symbols and Attributes Prompt for Fine-grained Fashion
Vision-Language Pre-training [12.652002299515864]
We propose a method for fine-grained fashion vision-language pre-training based on fashion Symbols and Attributes Prompt (FashionSAP)
Firstly, we propose the fashion symbols, a novel abstract fashion concept layer, to represent different fashion items.
Secondly, the attributes prompt method is proposed to make the model learn specific attributes of fashion items explicitly.
arXiv Detail & Related papers (2023-04-11T08:20:17Z) - FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified
Retrieval and Captioning [66.38951790650887]
Multimodal tasks in the fashion domain have significant potential for e-commerce.
We propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs.
We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks.
arXiv Detail & Related papers (2022-10-26T21:01:19Z) - ARMANI: Part-level Garment-Text Alignment for Unified Cross-Modal
Fashion Design [66.68194916359309]
Cross-modal fashion image synthesis has emerged as one of the most promising directions in the generation domain.
MaskCLIP decomposes the garments into semantic parts, ensuring fine-grained and semantically accurate alignment between the visual and text information.
ArmANI discretizes an image into uniform tokens based on a learned cross-modal codebook in its first stage and uses a Transformer to model the distribution of image tokens for a real image.
arXiv Detail & Related papers (2022-08-11T03:44:02Z) - FashionViL: Fashion-Focused Vision-and-Language Representation Learning [129.49630356651454]
We propose a novel fashion-focused Vision-and-Language (V+L) representation learning framework, dubbed as FashionViL.
It contains two novel fashion-specific pre-training tasks designed particularly to exploit two intrinsic attributes with fashion V+L data.
Extensive experiments show that our FashionViL achieves a new state of the art across five downstream tasks.
arXiv Detail & Related papers (2022-07-17T12:06:27Z) - Progressive Learning for Image Retrieval with Hybrid-Modality Queries [48.79599320198615]
Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR)
We decompose the CTI-IR task into a three-stage learning problem to progressively learn the complex knowledge for image retrieval with hybrid-modality queries.
Our proposed model significantly outperforms state-of-the-art methods in the mean of Recall@K by 24.9% and 9.5% on the Fashion-IQ and Shoes benchmark datasets respectively.
arXiv Detail & Related papers (2022-04-24T08:10:06Z) - Conversational Fashion Image Retrieval via Multiturn Natural Language
Feedback [36.623221002330226]
We study the task of conversational fashion image retrieval via multiturn natural language feedback.
We propose a novel framework that can effectively handle conversational fashion image retrieval with multiturn natural language feedback texts.
arXiv Detail & Related papers (2021-06-08T06:34:25Z) - A Strong Baseline for Fashion Retrieval with Person Re-Identification
Models [0.0]
Fashion retrieval is the challenging task of finding an exact match for fashion items contained within an image.
We introduce a simple baseline model for fashion retrieval, significantly outperforming previous state-of-the-art results.
We conduct in-depth experiments on Street2Shop and DeepFashion datasets and validate our results.
arXiv Detail & Related papers (2020-03-09T12:50:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.