A Multimodal In-Context Tuning Approach for E-Commerce Product
Description Generation
- URL: http://arxiv.org/abs/2402.13587v2
- Date: Thu, 7 Mar 2024 11:29:50 GMT
- Title: A Multimodal In-Context Tuning Approach for E-Commerce Product
Description Generation
- Authors: Yunxin Li, Baotian Hu, Wenhan Luo, Lin Ma, Yuxin Ding, Min Zhang
- Abstract summary: We propose a new setting for generating product descriptions from images, augmented by marketing keywords.
We present a simple and effective Multimodal In-Context Tuning approach, named ModICT, which introduces a similar product sample as the reference.
Experiments demonstrate that ModICT significantly improves the accuracy (by up to 3.3% on Rouge-L) and diversity (by up to 9.4% on D-5) of generated results compared to conventional methods.
- Score: 47.70824723223262
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In this paper, we propose a new setting for generating product descriptions
from images, augmented by marketing keywords. It leverages the combined power
of visual and textual information to create descriptions that are more tailored
to the unique features of products. For this setting, previous methods utilize
visual and textual encoders to encode the image and keywords and employ a
language model-based decoder to generate the product description. However, the
generated description is often inaccurate and generic since same-category
products have similar copy-writings, and optimizing the overall framework on
large-scale samples makes models concentrate on common words yet ignore the
product features. To alleviate the issue, we present a simple and effective
Multimodal In-Context Tuning approach, named ModICT, which introduces a similar
product sample as the reference and utilizes the in-context learning capability
of language models to produce the description. During training, we keep the
visual encoder and language model frozen, focusing on optimizing the modules
responsible for creating multimodal in-context references and dynamic prompts.
This approach preserves the language generation prowess of large language
models (LLMs), facilitating a substantial increase in description diversity. To
assess the effectiveness of ModICT across various language model scales and
types, we collect data from three distinct product categories within the
E-commerce domain. Extensive experiments demonstrate that ModICT significantly
improves the accuracy (by up to 3.3% on Rouge-L) and diversity (by up to 9.4%
on D-5) of generated results compared to conventional methods. Our findings
underscore the potential of ModICT as a valuable tool for enhancing automatic
generation of product descriptions in a wide range of applications. Code is at:
https://github.com/HITsz-TMG/Multimodal-In-Context-Tuning
Related papers
- Triple Modality Fusion: Aligning Visual, Textual, and Graph Data with Large Language Models for Multi-Behavior Recommendations [12.154043062308201]
This paper introduces a novel framework for multi-behavior recommendations, leveraging the fusion of triple-modality.
Our proposed model called Triple Modality Fusion (TMF) utilizes the power of large language models (LLMs) to align and integrate these three modalities.
Extensive experiments demonstrate the effectiveness of our approach in improving recommendation accuracy.
arXiv Detail & Related papers (2024-10-16T04:44:15Z) - TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - Position-Enhanced Visual Instruction Tuning for Multimodal Large
Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs)
This integration promotes a more detailed comprehension of images for the MLLM.
We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z) - MMAPS: End-to-End Multi-Grained Multi-Modal Attribute-Aware Product
Summarization [93.5217515566437]
Multi-modal Product Summarization (MPS) aims to increase customers' desire to purchase by highlighting product characteristics.
Existing MPS methods can produce promising results, but they still lack end-to-end product summarization.
We propose an end-to-end multi-modal attribute-aware product summarization method (MMAPS) for generating high-quality product summaries in e-commerce.
arXiv Detail & Related papers (2023-08-22T11:00:09Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - Boosting Multi-Modal E-commerce Attribute Value Extraction via Unified
Learning Scheme and Dynamic Range Minimization [14.223683006262151]
We propose a novel approach to boost multi-modal e-commerce attribute value extraction via unified learning scheme and dynamic range minimization.
Experiments on the popular multi-modal e-commerce benchmarks show that our approach achieves superior performance over the other state-of-the-art techniques.
arXiv Detail & Related papers (2022-07-15T03:58:04Z) - Fusion Models for Improved Visual Captioning [18.016295296424413]
This paper proposes a generic multimodal model fusion framework for caption generation and emendation.
We employ the same fusion strategies to integrate a pretrained Masked Language Model (MLM) with a visual captioning model, viz. Show, Attend, and Tell.
Our caption emendation experiments on three benchmark image captioning datasets, viz. Flickr8k, Flickr30k, and MSCOCO, show improvements over the baseline.
arXiv Detail & Related papers (2020-10-28T21:55:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.