Related papers: MIND: Multimodal Shopping Intention Distillation from Large Vision-language Models for E-commerce Purchase Understanding

MIND: Multimodal Shopping Intention Distillation from Large Vision-language Models for E-commerce Purchase Understanding

URL: http://arxiv.org/abs/2406.10701v1
Date: Sat, 15 Jun 2024 17:56:09 GMT
Title: MIND: Multimodal Shopping Intention Distillation from Large Vision-language Models for E-commerce Purchase Understanding
Authors: Baixuan Xu, Weiqi Wang, Haochen Shi, Wenxuan Ding, Huihao Jing, Tianqing Fang, Jiaxin Bai, Long Chen, Yangqiu Song,
Abstract summary: MIND is a framework that infers purchase intentions from multimodal product metadata and prioritizes human-centric ones. Using Amazon Review data, we create a multimodal intention knowledge base, which contains 1,264,441 million intentions. Our obtained intentions significantly enhance large language models in two intention comprehension tasks.
Score: 45.47495643376656
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Improving user experience and providing personalized search results in E-commerce platforms heavily rely on understanding purchase intention. However, existing methods for acquiring large-scale intentions bank on distilling large language models with human annotation for verification. Such an approach tends to generate product-centric intentions, overlook valuable visual information from product images, and incurs high costs for scalability. To address these issues, we introduce MIND, a multimodal framework that allows Large Vision-Language Models (LVLMs) to infer purchase intentions from multimodal product metadata and prioritize human-centric ones. Using Amazon Review data, we apply MIND and create a multimodal intention knowledge base, which contains 1,264,441 million intentions derived from 126,142 co-buy shopping records across 107,215 products. Extensive human evaluations demonstrate the high plausibility and typicality of our obtained intentions and validate the effectiveness of our distillation framework and filtering mechanism. Additional experiments reveal that our obtained intentions significantly enhance large language models in two intention comprehension tasks.

Related papers

Research on E-Commerce Long-Tail Product Recommendation Mechanism Based on Large-Scale Language Models [7.792622257477251]
We propose a novel long-tail product recommendation mechanism that integrates product text descriptions and user behavior sequences using a large-scale language model (LLM)<n>Our work highlights the potential of LLMs in interpreting product content and user intent, offering a promising direction for future e-commerce recommendation systems.
arXiv Detail & Related papers (2025-05-31T19:17:48Z)
OCC-MLLM-CoT-Alpha: Towards Multi-stage Occlusion Recognition Based on Large Language Models via 3D-Aware Supervision and Chain-of-Thoughts Guidance [3.832135091367811]
OCC-MLLM-CoT-Alpha is a multi-modal large vision language framework that integrates 3D-aware supervision and Chain-of-Thoughts guidance. In the evaluation, the proposed methods demonstrate decision score improvement of 15.75%,15.30%,16.98%,14.62%, and 4.42%,3.63%,6.94%,10.70% for two settings of a variety of state-of-the-art models.
arXiv Detail & Related papers (2025-04-07T07:15:26Z)
Personalized Multimodal Large Language Models: A Survey [127.9521218125761]
Multimodal Large Language Models (MLLMs) have become increasingly important due to their state-of-the-art performance and ability to integrate multiple data modalities. This paper presents a comprehensive survey on personalized multimodal large language models, focusing on their architecture, training methods, and applications.
arXiv Detail & Related papers (2024-12-03T03:59:03Z)
Retrieve, Annotate, Evaluate, Repeat: Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation [3.670782697615276]
Large Language Models (LLMs) have the potential to address this scaling issue. We propose a framework for assessing the product search engines in a large-scale e-commerce setting. Our method, validated through deployment on a large e-commerce platform, demonstrates comparable quality to human annotations.
arXiv Detail & Related papers (2024-09-18T10:30:50Z)
Image Score: Learning and Evaluating Human Preferences for Mercari Search [2.1555050262085027]
Large Language Models (LLMs) are being actively studied and used for data labelling tasks. We propose a cost-efficient LLM-driven approach for assessing and predicting image quality in e-commerce settings. We show that our LLM-produced labels correlate with user behavior on Mercari.
arXiv Detail & Related papers (2024-08-21T05:30:06Z)
LLAVADI: What Matters For Multimodal Large Language Models Distillation [77.73964744238519]
In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch. Our studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process. By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters.
arXiv Detail & Related papers (2024-07-28T06:10:47Z)
IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerce [71.37481473399559]
In this paper, we present IntentionQA, a benchmark to evaluate LMs' comprehension of purchase intentions in E-commerce. IntentionQA consists of 4,360 carefully curated problems across three difficulty levels, constructed using an automated pipeline. Human evaluations demonstrate the high quality and low false-negative rate of our benchmark.
arXiv Detail & Related papers (2024-06-14T16:51:21Z)
Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model [82.93634081255942]
We propose a vision-language connector that enables MLLMs to achieve high accuracy while maintain low cost. We first reveal the existence of the visual anchors in Vision Transformer and propose a cost-effective search algorithm to extract them. We introduce the Anchor Former (AcFormer), a novel vision-language connector designed to leverage the rich prior knowledge obtained from these visual anchors during pretraining.
arXiv Detail & Related papers (2024-05-28T04:23:00Z)
ItemSage: Learning Product Embeddings for Shopping Recommendations at Pinterest [60.841761065439414]
At Pinterest, we build a single set of product embeddings called ItemSage to provide relevant recommendations in all shopping use cases. This approach has led to significant improvements in engagement and conversion metrics, while reducing both infrastructure and maintenance cost.
arXiv Detail & Related papers (2022-05-24T02:28:58Z)
Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-modal Pretraining [108.86502855439774]
We investigate a more realistic setting that aims to perform weakly-supervised multi-modal instance-level product retrieval. We contribute Product1M, one of the largest multi-modal cosmetic datasets for real-world instance-level retrieval. We propose a novel model named Cross-modal contrAstive Product Transformer for instance-level prodUct REtrieval (CAPTURE)
arXiv Detail & Related papers (2021-07-30T12:11:24Z)
A Multimodal Late Fusion Model for E-Commerce Product Classification [7.463657960984954]
In this study, we investigated a multimodal late fusion approach based on text and image modalities to categorize e-commerce products on Rakuten. Specifically, we developed modal specific state-of-the-art deep neural networks for each input modal, and then fused them at the decision level. Our team named pa_curis won the 1st place with a macro-F1 of 0.9144 on the final leaderboard.
arXiv Detail & Related papers (2020-08-14T03:46:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.