Related papers: FashionLOGO: Prompting Multimodal Large Language Models for Fashion Logo Embeddings

FashionLOGO: Prompting Multimodal Large Language Models for Fashion Logo Embeddings

URL: http://arxiv.org/abs/2308.09012v1
Date: Thu, 17 Aug 2023 14:30:26 GMT
Title: FashionLOGO: Prompting Multimodal Large Language Models for Fashion Logo Embeddings
Authors: Yulin Su, Min Yang, Minghui Qiu, Jing Wang, Tao Wang
Abstract summary: We propose a novel approach that leverages textual knowledge as an auxiliary to improve the robustness of logo embedding. We adopt a cross-attention transformer to enable image embedding queries to learn supplementary knowledge from textual embeddings automatically. Our experiments on three real-world datasets demonstrate that FashionLOGO learns generalized and robust logo embeddings.
Score: 27.2486625516539
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Logo embedding plays a crucial role in various e-commerce applications by facilitating image retrieval or recognition, such as intellectual property protection and product search. However, current methods treat logo embedding as a purely visual problem, which may limit their performance in real-world scenarios. A notable issue is that the textual knowledge embedded in logo images has not been adequately explored. Therefore, we propose a novel approach that leverages textual knowledge as an auxiliary to improve the robustness of logo embedding. The emerging Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in both visual and textual understanding and could become valuable visual assistants in understanding logo images. Inspired by this observation, our proposed method, FashionLOGO, aims to utilize MLLMs to enhance fashion logo embedding. We explore how MLLMs can improve logo embedding by prompting them to generate explicit textual knowledge through three types of prompts, including image OCR, brief captions, and detailed descriptions prompts, in a zero-shot setting. We adopt a cross-attention transformer to enable image embedding queries to learn supplementary knowledge from textual embeddings automatically. To reduce computational costs, we only use the image embedding model in the inference stage, similar to traditional inference pipelines. Our extensive experiments on three real-world datasets demonstrate that FashionLOGO learns generalized and robust logo embeddings, achieving state-of-the-art performance in all benchmark datasets. Furthermore, we conduct comprehensive ablation studies to demonstrate the performance improvements resulting from the introduction of MLLMs.

Related papers

LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models [60.67899965748755]
We present LLaVA-Read, a multimodal large language model that utilizes dual visual encoders along with a visual text encoder. Our research suggests visual text understanding remains an open challenge and an efficient visual text encoder is crucial for future successful multimodal systems.
arXiv Detail & Related papers (2024-07-27T05:53:37Z)
LogoSticker: Inserting Logos into Diffusion Models for Customized Generation [73.59571559978278]
We introduce the task of logo insertion into text-to-image models. Our goal is to insert logo identities into diffusion models and enable their seamless synthesis in varied contexts. We present a novel two-phase pipeline LogoSticker to tackle this task.
arXiv Detail & Related papers (2024-07-18T17:54:49Z)
Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets. However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs. This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z)
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest. This technique allows LVLMs to access more detailed visual information without altering the original image resolution. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z)
Beyond Text: Frozen Large Language Models in Visual Signal Comprehension [34.398976855955404]
Vision-to-Language Tokenizer, abbreviated as V2T Tokenizer, transforms an image into a foreign language'' with the combined aid of an encoder-decoder, the LLM vocabulary, and a CLIP model. We undertake rigorous experiments to validate our method, encompassing understanding tasks like image recognition, image captioning, and visual question answering, as well as image denoising tasks like inpainting, outpainting, deblurring, and shift restoration.
arXiv Detail & Related papers (2024-03-12T17:59:51Z)
Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary. We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z)
Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z)
Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models [24.456117679941816]
Contrastive Reading Model (Cream) is a novel neural architecture designed to enhance the language-image understanding capability of Large Language Models (LLMs) Our approach bridges the gap between vision and language understanding, paving the way for the development of more sophisticated Document Intelligence Assistants.
arXiv Detail & Related papers (2023-05-24T11:59:13Z)
Contrastive Multi-View Textual-Visual Encoding: Towards One Hundred Thousand-Scale One-Shot Logo Identification [2.243832625209014]
We study the problem of identifying logos of business brands in natural scenes in an open-set one-shot setting. We propose a novel multi-view textual-visual encoding framework that encodes text appearing in the logos. We evaluate our proposed framework for cropped logo verification, cropped logo identification, and end-to-end logo identification in natural scene tasks.
arXiv Detail & Related papers (2022-11-23T12:59:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.