Related papers: Bridging Modality Gaps in e-Commerce Products via Vision-Language Alignment

Bridging Modality Gaps in e-Commerce Products via Vision-Language Alignment

URL: http://arxiv.org/abs/2508.10116v1
Date: Wed, 13 Aug 2025 18:22:53 GMT
Title: Bridging Modality Gaps in e-Commerce Products via Vision-Language Alignment
Authors: Yipeng Zhang, Hongju Yu, Aritra Mandal, Canran Xu, Qunzhi Zhou, Zhe Wu,
Abstract summary: We propose optimized preference-based AI for listings (OPAL) to generate high-quality item descriptions from images.<n>OPAL bridges the gap between visual and textual modalities, delivering richer, more accurate, and more consistent item descriptions.<n>This work advances automated listing optimization and supports scalable, high-quality content generation in e-commerce platforms.
Score: 15.068156309599662
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Item information, such as titles and attributes, is essential for effective user engagement in e-commerce. However, manual or semi-manual entry of structured item specifics often produces inconsistent quality, errors, and slow turnaround, especially for Customer-to-Customer sellers. Generating accurate descriptions directly from item images offers a promising alternative. Existing retrieval-based solutions address some of these issues but often miss fine-grained visual details and struggle with niche or specialized categories. We propose Optimized Preference-Based AI for Listings (OPAL), a framework for generating schema-compliant, high-quality item descriptions from images using a fine-tuned multimodal large language model (MLLM). OPAL addresses key challenges in multimodal e-commerce applications, including bridging modality gaps and capturing detailed contextual information. It introduces two data refinement methods: MLLM-Assisted Conformity Enhancement, which ensures alignment with structured schema requirements, and LLM-Assisted Contextual Understanding, which improves the capture of nuanced and fine-grained information from visual inputs. OPAL uses visual instruction tuning combined with direct preference optimization to fine-tune the MLLM, reducing hallucinations and improving robustness across different backbone architectures. We evaluate OPAL on real-world e-commerce datasets, showing that it consistently outperforms baseline methods in both description quality and schema completion rates. These results demonstrate that OPAL effectively bridges the gap between visual and textual modalities, delivering richer, more accurate, and more consistent item descriptions. This work advances automated listing optimization and supports scalable, high-quality content generation in e-commerce platforms.

Related papers

Benchmarking Multimodal Large Language Models for Missing Modality Completion in Product Catalogues [19.732113077201326]
Missing-modality information on e-commerce platforms, such as absent product images or textual descriptions, often arises from annotation errors or incomplete metadata.<n>This work investigates the question: can Multimodal Large Language Models generate missing modalities for products in e-commerce scenarios?<n>We propose the Missing Modality Product Completion Benchmark (MMPCBench), which consists of two sub-benchmarks: a Content Quality Completion Benchmark and a Recommendation Benchmark.<n>We evaluate six state-of-the-art MLLMs from the Qwen2.5-VL and Gemma-3 model families across nine real-world e-commerce categories, focusing
arXiv Detail & Related papers (2026-01-27T16:13:26Z)
PixRec: Leveraging Visual Context for Next-Item Prediction in Sequential Recommendation [3.437656066916039]
PixRec is a vision-language framework that incorporates both textual attributes and product images into the recommendation pipeline.<n>Our work outlines future directions for scaling multi-modal recommenders training, enhancing visual-text feature fusion, and evaluating inference-time performance.
arXiv Detail & Related papers (2026-01-10T06:52:58Z)
Adapting Large VLMs with Iterative and Manual Instructions for Generative Low-light Enhancement [41.66776033752888]
Most low-light image enhancement methods rely on pre-trained model priors, low-light inputs, or both.<n>We propose VLM-IMI, a novel framework that leverages large vision-language models with iterative and manual instructions.<n>VLM-IMI incorporates textual descriptions of the desired normal-light content as enhancement cues, enabling semantically informed restoration.
arXiv Detail & Related papers (2025-07-24T03:35:20Z)
VL-CLIP: Enhancing Multimodal Recommendations via Visual Grounding and LLM-Augmented CLIP Embeddings [11.209519424876762]
Multimodal learning plays a critical role in e-commerce recommendation platforms today.<n>Existing vision-language models, such as CLIP, face key challenges in e-commerce recommendation systems.<n>We propose a framework, VL-CLIP, that enhances CLIP embeddings by integrating Visual Grounding for fine-grained visual understanding.
arXiv Detail & Related papers (2025-07-22T23:45:43Z)
Learning Item Representations Directly from Multimodal Features for Effective Recommendation [51.49251689107541]
multimodal recommender systems predominantly leverage Bayesian Personalized Ranking (BPR) optimization to learn item representations.<n>We propose a novel model (i.e., LIRDRec) that learns item representations directly from multimodal features to augment recommendation performance.
arXiv Detail & Related papers (2025-05-08T05:42:22Z)
Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning [10.761218096540976]
Multimodal reasoning in Large Language Models (LLMs) struggles with incomplete knowledge and hallucination artifacts.<n>We propose Vision-align-to-Language integrated Knowledge Graph (VaLiK), a novel approach for constructing Multimodal Knowledge Graphs.
arXiv Detail & Related papers (2025-03-17T09:31:14Z)
Training Large Recommendation Models via Graph-Language Token Alignment [53.3142545812349]
We propose a novel framework to train Large Recommendation models via Graph-Language Token Alignment.<n>By aligning item and user nodes from the interaction graph with pretrained LLM tokens, GLTA effectively leverages the reasoning abilities of LLMs.<n> Furthermore, we introduce Graph-Language Logits Matching (GLLM) to optimize token alignment for end-to-end item prediction.
arXiv Detail & Related papers (2025-02-26T02:19:10Z)
Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data. We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z)
SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization [70.11167263638562]
Social relation reasoning aims to identify relation categories such as friends, spouses, and colleagues from images. We first present a simple yet well-crafted framework named name, which combines the perception capability of Vision Foundation Models (VFMs) and the reasoning capability of Large Language Models (LLMs) within a modular framework.
arXiv Detail & Related papers (2024-10-28T18:10:26Z)
ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes. ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation. This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z)
Fine-tuning Multimodal Large Language Models for Product Bundling [53.01642741096356]
We introduce Bundle-MLLM, a novel framework that fine-tunes large language models (LLMs) through a hybrid item tokenization approach.<n>Specifically, we integrate textual, media, and relational data into a unified tokenization, introducing a soft separation token to distinguish between textual and non-textual tokens.<n>We propose a progressive optimization strategy that fine-tunes LLMs for disentangled objectives: 1) learning bundle patterns and 2) enhancing multimodal semantic understanding specific to product bundling.
arXiv Detail & Related papers (2024-07-16T13:30:14Z)
A Multimodal In-Context Tuning Approach for E-Commerce Product Description Generation [47.70824723223262]
We propose a new setting for generating product descriptions from images, augmented by marketing keywords. We present a simple and effective Multimodal In-Context Tuning approach, named ModICT, which introduces a similar product sample as the reference. Experiments demonstrate that ModICT significantly improves the accuracy (by up to 3.3% on Rouge-L) and diversity (by up to 9.4% on D-5) of generated results compared to conventional methods.
arXiv Detail & Related papers (2024-02-21T07:38:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.