Related papers: Beyond Text: Aligning Vision and Language for Multimodal E-Commerce Retrieval

Beyond Text: Aligning Vision and Language for Multimodal E-Commerce Retrieval

URL: http://arxiv.org/abs/2603.04836v1
Date: Thu, 05 Mar 2026 05:43:45 GMT
Title: Beyond Text: Aligning Vision and Language for Multimodal E-Commerce Retrieval
Authors: Qujiaheng Zhang, Guagnyue Xu, Fengjie Li,
Abstract summary: We study unified text-image fusion for two-tower retrieval models in the e-commerce domain.<n>We demonstrate that domain-specific fine-tuning and two stage alignment between query with product text and image modalities are both crucial for effective multimodal retrieval.<n>We propose a noval modality fusion network to fuse image and text information and capture cross-modal complementary information.
Score: 0.669087470775851
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern e-commerce search is inherently multimodal: customers make purchase decisions by jointly considering product text and visual informations. However, most industrial retrieval and ranking systems primarily rely on textual information, underutilizing the rich visual signals available in product images. In this work, we study unified text-image fusion for two-tower retrieval models in the e-commerce domain. We demonstrate that domain-specific fine-tuning and two stage alignment between query with product text and image modalities are both crucial for effective multimodal retrieval. Building on these insights, we propose a noval modality fusion network to fuse image and text information and capture cross-modal complementary information. Experiments on large-scale e-commerce datasets validate the effectiveness of the proposed approach.

Related papers

Turning Adversaries into Allies: Reversing Typographic Attacks for Multimodal E-Commerce Product Retrieval [2.0134842677651084]
Multimodal product retrieval systems in e-commerce platforms rely on effectively combining visual and textual signals to improve search relevance and user experience.<n>We propose a novel method that reverses the logic of typographic attacks by rendering relevant textual content directly onto product images.<n>We evaluate our method on three vertical-specific e-commerce datasets (sneakers, handbags, and trading cards) using six state-of-the-art vision foundation models.
arXiv Detail & Related papers (2025-11-07T15:24:18Z)
Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents [99.62178668680578]
We propose Vision-Centric Contrastive Learning (VC2L), a unified framework that models text, images, and their combinations using a single vision transformer.<n> VC2L operates entirely in pixel space by rendering all inputs, whether textual, visual, or combined, as images.<n>To capture complex cross-modal relationships in web documents, VC2L employs a snippet-level contrastive learning objective that aligns consecutive multimodal segments.
arXiv Detail & Related papers (2025-10-21T14:59:29Z)
Multimodal semantic retrieval for product search [6.185573921868495]
We build a multimodal representation for product items in e-commerce search in contrast to pure-text representation of products.<n>We demonstrate that a multimodal representation scheme for a product can show improvement on purchase recall or relevance accuracy in semantic retrieval.
arXiv Detail & Related papers (2025-01-13T14:34:26Z)
Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
We propose Leopard, an MLLM tailored for handling vision-language tasks involving multiple text-rich images.<n>First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios.<n>Second, we proposed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z)
Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization [49.08348604716746]
Multimodal Summarization with Multimodal Output (MSMO) aims to produce a multimodal summary that integrates both text and relevant images. In this paper, we propose an Entity-Guided Multimodal Summarization model (EGMS) Our model, building on BART, utilizes dual multimodal encoders with shared weights to process text-image and entity-image information concurrently.
arXiv Detail & Related papers (2024-08-06T12:45:56Z)
TIGeR: Unifying Text-to-Image Generation and Retrieval with Large Multimodal Models [96.72318842152148]
We propose a unified framework for text-to-image generation and retrieval with one single Large Multimodal Model (LMM)<n> Specifically, we first explore the intrinsic discriminative abilities of LMMs and introduce an efficient generative retrieval method for text-to-image retrieval in a training-free manner.<n>We then propose an autonomous decision mechanism to choose the best-matched one between generated and retrieved images as the response to the text prompt.
arXiv Detail & Related papers (2024-06-09T15:00:28Z)
MVAM: Multi-View Attention Method for Fine-grained Image-Text Matching [65.87255122130188]
We propose a Multi-view Attention Method (MVAM) for image-text matching.<n>We also incorporate an objective to explicitly encourage attention heads to focus on distinct aspects of the input data.<n>Our method allows models to encode images and text from different perspectives and focus on more critical details, leading to better matching performance.
arXiv Detail & Related papers (2024-02-27T06:11:54Z)
EDIS: Entity-Driven Image Search over Multimodal Web Content [95.40238328527931]
We introduce textbfEntity-textbfDriven textbfImage textbfSearch (EDIS), a dataset for cross-modal image search in the news domain. EDIS consists of 1 million web images from actual search engine results and curated datasets, with each image paired with a textual description.
arXiv Detail & Related papers (2023-05-23T02:59:19Z)
Unified Vision-Language Representation Modeling for E-Commerce Same-Style Products Retrieval [12.588713044749177]
Same-style products retrieval plays an important role in e-commerce platforms. We propose a unified vision-language modeling method for e-commerce same-style products retrieval. It is capable of cross-modal product-to-product retrieval, as well as style transfer and user-interactive search.
arXiv Detail & Related papers (2023-02-10T07:24:23Z)
Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE. We propose a novel Multi-modal Retrieval based framework (MoRe) MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z)
ACE-BERT: Adversarial Cross-modal Enhanced BERT for E-commerce Retrieval [6.274310862007448]
We propose a novel Adrial Cross-modal Enhanced BERT (ACE-BERT) for efficient E-commerce retrieval. With the pre-trained enhanced BERT as the backbone network, ACE-BERT adopts adversarial learning to ensure the distribution consistency of different modality representations. Experimental results demonstrate that ACE-BERT outperforms the state-of-the-art approaches on the retrieval task.
arXiv Detail & Related papers (2021-12-14T07:36:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.