Smart Multi-Modal Search: Contextual Sparse and Dense Embedding Integration in Adobe Express
- URL: http://arxiv.org/abs/2408.14698v2
- Date: Thu, 29 Aug 2024 15:14:48 GMT
- Title: Smart Multi-Modal Search: Contextual Sparse and Dense Embedding Integration in Adobe Express
- Authors: Cherag Aroraa, Tracy Holloway King, Jayant Kumar, Yi Lu, Sanat Sharma, Arvind Srikantan, David Uvalle, Josep Valls-Vargas, Harsha Vardhan,
- Abstract summary: Building a scalable multi-modal search system requires fine-tuning several components.
We address considerations such as embedding model selection, the roles of embeddings in matching and ranking, and the balance between dense and sparse embeddings.
- Score: 3.8973445113342433
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As user content and queries become increasingly multi-modal, the need for effective multi-modal search systems has grown. Traditional search systems often rely on textual and metadata annotations for indexed images, while multi-modal embeddings like CLIP enable direct search using text and image embeddings. However, embedding-based approaches face challenges in integrating contextual features such as user locale and recency. Building a scalable multi-modal search system requires fine-tuning several components. This paper presents a multi-modal search architecture and a series of AB tests that optimize embeddings and multi-modal technologies in Adobe Express template search. We address considerations such as embedding model selection, the roles of embeddings in matching and ranking, and the balance between dense and sparse embeddings. Our iterative approach demonstrates how utilizing sparse, dense, and contextual features enhances short and long query search, significantly reduces null rates (over 70\%), and increases click-through rates (CTR). Our findings provide insights into developing robust multi-modal search systems, thereby enhancing relevance for complex queries.
Related papers
- MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs [78.5013630951288]
This paper introduces techniques for advancing information retrieval with multimodal large language models (MLLMs)
We first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks.
We propose modality-aware hard negative mining to mitigate the modality bias exhibited by MLLM retrievers.
arXiv Detail & Related papers (2024-11-04T20:06:34Z) - A Survey of Multimodal Composite Editing and Retrieval [7.966265020507201]
This survey is the first comprehensive review of the literature on multimodal composite retrieval.
It covers image-text composite editing, image-text composite retrieval, and other multimodal composite retrieval.
We systematically organize the application scenarios, methods, benchmarks, experiments, and future directions.
arXiv Detail & Related papers (2024-09-09T08:06:50Z) - MRSE: An Efficient Multi-modality Retrieval System for Large Scale E-commerce [42.3177388371158]
Current Embedding-based Retrieval Systems embed queries and items into a shared low-dimensional space.
We propose MRSE, a Multi-modality Retrieval System that integrates text, item images, and user preferences.
MRSE achieves an 18.9% improvement in offline relevance and a 3.7% gain in online core metrics compared to Shopee's state-of-the-art uni-modality system.
arXiv Detail & Related papers (2024-08-27T11:21:19Z) - Leveraging Large Language Models for Multimodal Search [0.6249768559720121]
This paper introduces a novel multimodal search model that achieves a new performance milestone on the Fashion200K dataset.
We also propose a novel search interface integrating Large Language Models (LLMs) to facilitate natural language interaction.
arXiv Detail & Related papers (2024-04-24T10:30:42Z) - DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever [83.33209603041013]
We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval.
Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP.
To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
arXiv Detail & Related papers (2024-01-02T07:40:12Z) - Searching, fast and slow, through product catalogs [5.077235981745305]
We present a unified architecture for SKU search that provides both a real-time suggestion system and a lower latency search system.
We show how our system vastly outperforms, in all aspects, the results provided by the default search engine.
arXiv Detail & Related papers (2024-01-01T12:30:46Z) - Large Search Model: Redefining Search Stack in the Era of LLMs [63.503320030117145]
We introduce a novel conceptual framework called large search model, which redefines the conventional search stack by unifying search tasks with one large language model (LLM)
All tasks are formulated as autoregressive text generation problems, allowing for the customization of tasks through the use of natural language prompts.
This proposed framework capitalizes on the strong language understanding and reasoning capabilities of LLMs, offering the potential to enhance search result quality while simultaneously simplifying the existing cumbersome search stack.
arXiv Detail & Related papers (2023-10-23T05:52:09Z) - End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries.
We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion.
We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z) - Probabilistic Compositional Embeddings for Multimodal Image Retrieval [48.450232527041436]
We investigate a more challenging scenario for composing multiple multimodal queries in image retrieval.
Given an arbitrary number of query images and (or) texts, our goal is to retrieve target images containing the semantic concepts specified in multiple multimodal queries.
We propose a novel multimodal probabilistic composer (MPC) to learn an informative embedding that can flexibly encode the semantics of various queries.
arXiv Detail & Related papers (2022-04-12T14:45:37Z) - Exposing Query Identification for Search Transparency [69.06545074617685]
We explore the feasibility of approximate exposing query identification (EQI) as a retrieval task by reversing the role of queries and documents in two classes of search systems.
We derive an evaluation metric to measure the quality of a ranking of exposing queries, as well as conducting an empirical analysis focusing on various practical aspects of approximate EQI.
arXiv Detail & Related papers (2021-10-14T20:19:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.