MMAPS: End-to-End Multi-Grained Multi-Modal Attribute-Aware Product
Summarization
- URL: http://arxiv.org/abs/2308.11351v2
- Date: Fri, 8 Mar 2024 03:07:18 GMT
- Title: MMAPS: End-to-End Multi-Grained Multi-Modal Attribute-Aware Product
Summarization
- Authors: Tao Chen, Ze Lin, Hui Li, Jiayi Ji, Yiyi Zhou, Guanbin Li and Rongrong
Ji
- Abstract summary: Multi-modal Product Summarization (MPS) aims to increase customers' desire to purchase by highlighting product characteristics.
Existing MPS methods can produce promising results, but they still lack end-to-end product summarization.
We propose an end-to-end multi-modal attribute-aware product summarization method (MMAPS) for generating high-quality product summaries in e-commerce.
- Score: 93.5217515566437
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given the long textual product information and the product image, Multi-modal
Product Summarization (MPS) aims to increase customers' desire to purchase by
highlighting product characteristics with a short textual summary. Existing MPS
methods can produce promising results. Nevertheless, they still 1) lack
end-to-end product summarization, 2) lack multi-grained multi-modal modeling,
and 3) lack multi-modal attribute modeling. To improve MPS, we propose an
end-to-end multi-grained multi-modal attribute-aware product summarization
method (MMAPS) for generating high-quality product summaries in e-commerce.
MMAPS jointly models product attributes and generates product summaries. We
design several multi-grained multi-modal tasks to better guide the multi-modal
learning of MMAPS. Furthermore, we model product attributes based on both text
and image modalities so that multi-modal product characteristics can be
manifested in the generated summaries. Extensive experiments on a real
large-scale Chinese e-commence dataset demonstrate that our model outperforms
state-of-the-art product summarization methods w.r.t. several summarization
metrics. Our code is publicly available at: https://github.com/KDEGroup/MMAPS.
Related papers
- Fine-tuning Multimodal Large Language Models for Product Bundling [53.01642741096356]
We introduce Bundle-MLLM, a novel framework that fine-tunes large language models (LLMs) through a hybrid item tokenization approach.
Specifically, we integrate textual, media, and relational data into a unified tokenization, introducing a soft separation token to distinguish between textual and non-textual tokens.
We propose a progressive optimization strategy that fine-tunes LLMs for disentangled objectives: 1) learning bundle patterns and 2) enhancing multimodal semantic understanding specific to product bundling.
arXiv Detail & Related papers (2024-07-16T13:30:14Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - MM-GEF: Multi-modal representation meet collaborative filtering [43.88159639990081]
We propose a graph-based item structure enhancement method MM-GEF: Multi-Modal recommendation with Graph Early-Fusion.
MM-GEF learns refined item representations by injecting structural information obtained from both multi-modal and collaborative signals.
arXiv Detail & Related papers (2023-08-14T15:47:36Z) - Align and Attend: Multimodal Summarization with Dual Contrastive Losses [57.83012574678091]
The goal of multimodal summarization is to extract the most important information from different modalities to form output summaries.
Existing methods fail to leverage the temporal correspondence between different modalities and ignore the intrinsic correlation between different samples.
We introduce Align and Attend Multimodal Summarization (A2Summ), a unified multimodal transformer-based model which can effectively align and attend the multimodal input.
arXiv Detail & Related papers (2023-03-13T17:01:42Z) - Boosting Multi-Modal E-commerce Attribute Value Extraction via Unified
Learning Scheme and Dynamic Range Minimization [14.223683006262151]
We propose a novel approach to boost multi-modal e-commerce attribute value extraction via unified learning scheme and dynamic range minimization.
Experiments on the popular multi-modal e-commerce benchmarks show that our approach achieves superior performance over the other state-of-the-art techniques.
arXiv Detail & Related papers (2022-07-15T03:58:04Z) - Product1M: Towards Weakly Supervised Instance-Level Product Retrieval
via Cross-modal Pretraining [108.86502855439774]
We investigate a more realistic setting that aims to perform weakly-supervised multi-modal instance-level product retrieval.
We contribute Product1M, one of the largest multi-modal cosmetic datasets for real-world instance-level retrieval.
We propose a novel model named Cross-modal contrAstive Product Transformer for instance-level prodUct REtrieval (CAPTURE)
arXiv Detail & Related papers (2021-07-30T12:11:24Z) - Mining Latent Structures for Multimedia Recommendation [46.70109406399858]
We propose a LATent sTructure mining method for multImodal reCommEndation, which we term LATTICE for brevity.
We learn item-item structures for each modality and aggregates multiple modalities to obtain latent item graphs.
Based on the learned latent graphs, we perform graph convolutions to explicitly inject high-order item affinities into item representations.
arXiv Detail & Related papers (2021-04-19T03:50:24Z) - Multimodal Joint Attribute Prediction and Value Extraction for
E-commerce Product [40.46223408546036]
Product attribute values are essential in many e-commerce scenarios, such as customer service robots, product recommendations, and product retrieval.
While in the real world, the attribute values of a product are usually incomplete and vary over time, which greatly hinders the practical applications.
We propose a multimodal method to jointly predict product attributes and extract values from textual product descriptions with the help of the product images.
arXiv Detail & Related papers (2020-09-15T15:10:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.