FashionLOGO: Prompting Multimodal Large Language Models for Fashion Logo
Embeddings
- URL: http://arxiv.org/abs/2308.09012v1
- Date: Thu, 17 Aug 2023 14:30:26 GMT
- Title: FashionLOGO: Prompting Multimodal Large Language Models for Fashion Logo
Embeddings
- Authors: Yulin Su, Min Yang, Minghui Qiu, Jing Wang, Tao Wang
- Abstract summary: We propose a novel approach that leverages textual knowledge as an auxiliary to improve the robustness of logo embedding.
We adopt a cross-attention transformer to enable image embedding queries to learn supplementary knowledge from textual embeddings automatically.
Our experiments on three real-world datasets demonstrate that FashionLOGO learns generalized and robust logo embeddings.
- Score: 27.2486625516539
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Logo embedding plays a crucial role in various e-commerce applications by
facilitating image retrieval or recognition, such as intellectual property
protection and product search. However, current methods treat logo embedding as
a purely visual problem, which may limit their performance in real-world
scenarios. A notable issue is that the textual knowledge embedded in logo
images has not been adequately explored. Therefore, we propose a novel approach
that leverages textual knowledge as an auxiliary to improve the robustness of
logo embedding. The emerging Multimodal Large Language Models (MLLMs) have
demonstrated remarkable capabilities in both visual and textual understanding
and could become valuable visual assistants in understanding logo images.
Inspired by this observation, our proposed method, FashionLOGO, aims to utilize
MLLMs to enhance fashion logo embedding. We explore how MLLMs can improve logo
embedding by prompting them to generate explicit textual knowledge through
three types of prompts, including image OCR, brief captions, and detailed
descriptions prompts, in a zero-shot setting. We adopt a cross-attention
transformer to enable image embedding queries to learn supplementary knowledge
from textual embeddings automatically. To reduce computational costs, we only
use the image embedding model in the inference stage, similar to traditional
inference pipelines. Our extensive experiments on three real-world datasets
demonstrate that FashionLOGO learns generalized and robust logo embeddings,
achieving state-of-the-art performance in all benchmark datasets. Furthermore,
we conduct comprehensive ablation studies to demonstrate the performance
improvements resulting from the introduction of MLLMs.
Related papers
- Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding [112.87441334765693]
OMG-LLaVA is a new framework combining powerful pixel-level vision understanding with reasoning abilities.
It can accept various visual and text prompts for flexible user interaction.
OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model.
arXiv Detail & Related papers (2024-06-27T17:59:01Z) - StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond [68.0107158115377]
We have crafted an efficient vision-language model, StrucTexTv3, tailored to tackle various intelligent tasks for text-rich images.
We enhance the perception and comprehension abilities of StrucTexTv3 through instruction learning.
Our method achieved SOTA results in text-rich image perception tasks, and significantly improved performance in comprehension tasks.
arXiv Detail & Related papers (2024-05-31T16:55:04Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Beyond Text: Frozen Large Language Models in Visual Signal Comprehension [34.398976855955404]
Vision-to-Language Tokenizer, abbreviated as V2T Tokenizer, transforms an image into a foreign language'' with the combined aid of an encoder-decoder, the LLM vocabulary, and a CLIP model.
We undertake rigorous experiments to validate our method, encompassing understanding tasks like image recognition, image captioning, and visual question answering, as well as image denoising tasks like inpainting, outpainting, deblurring, and shift restoration.
arXiv Detail & Related papers (2024-03-12T17:59:51Z) - Generative Cross-Modal Retrieval: Memorizing Images in Multimodal
Language Models for Retrieval and Beyond [99.73306923465424]
We introduce a generative cross-modal retrieval framework, which assigns unique identifier strings to represent images.
By memorizing images in MLLMs, we introduce a new paradigm to cross-modal retrieval, distinct from previous discriminative approaches.
arXiv Detail & Related papers (2024-02-16T16:31:46Z) - InternLM-XComposer: A Vision-Language Large Model for Advanced
Text-image Comprehension and Composition [111.65584066987036]
InternLM-XComposer is a vision-language large model that enables advanced image-text comprehension and composition.
It can effortlessly generate coherent and contextual articles that seamlessly integrate images.
It can intelligently identify the areas in the text where images would enhance the content and automatically insert the most appropriate visual candidates.
arXiv Detail & Related papers (2023-09-26T17:58:20Z) - Visually-Situated Natural Language Understanding with Contrastive
Reading Model and Frozen Large Language Models [24.456117679941816]
Contrastive Reading Model (Cream) is a novel neural architecture designed to enhance the language-image understanding capability of Large Language Models (LLMs)
Our approach bridges the gap between vision and language understanding, paving the way for the development of more sophisticated Document Intelligence Assistants.
arXiv Detail & Related papers (2023-05-24T11:59:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.