Related papers: Boosting Multi-Modal E-commerce Attribute Value Extraction via Unified Learning Scheme and Dynamic Range Minimization

Boosting Multi-Modal E-commerce Attribute Value Extraction via Unified Learning Scheme and Dynamic Range Minimization

URL: http://arxiv.org/abs/2207.07278v2
Date: Thu, 6 Apr 2023 15:16:59 GMT
Title: Boosting Multi-Modal E-commerce Attribute Value Extraction via Unified Learning Scheme and Dynamic Range Minimization
Authors: Mengyin Liu, Chao Zhu, Hongyu Gao, Weibo Gu, Hongfa Wang, Wei Liu, Xu-cheng Yin
Abstract summary: We propose a novel approach to boost multi-modal e-commerce attribute value extraction via unified learning scheme and dynamic range minimization. Experiments on the popular multi-modal e-commerce benchmarks show that our approach achieves superior performance over the other state-of-the-art techniques.
Score: 14.223683006262151
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the prosperity of e-commerce industry, various modalities, e.g., vision and language, are utilized to describe product items. It is an enormous challenge to understand such diversified data, especially via extracting the attribute-value pairs in text sequences with the aid of helpful image regions. Although a series of previous works have been dedicated to this task, there remain seldomly investigated obstacles that hinder further improvements: 1) Parameters from up-stream single-modal pretraining are inadequately applied, without proper jointly fine-tuning in a down-stream multi-modal task. 2) To select descriptive parts of images, a simple late fusion is widely applied, regardless of priori knowledge that language-related information should be encoded into a common linguistic embedding space by stronger encoders. 3) Due to diversity across products, their attribute sets tend to vary greatly, but current approaches predict with an unnecessary maximal range and lead to more potential false positives. To address these issues, we propose in this paper a novel approach to boost multi-modal e-commerce attribute value extraction via unified learning scheme and dynamic range minimization: 1) Firstly, a unified scheme is designed to jointly train a multi-modal task with pretrained single-modal parameters. 2) Secondly, a text-guided information range minimization method is proposed to adaptively encode descriptive parts of each modality into an identical space with a powerful pretrained linguistic model. 3) Moreover, a prototype-guided attribute range minimization method is proposed to first determine the proper attribute set of the current product, and then select prototypes to guide the prediction of the chosen attributes. Experiments on the popular multi-modal e-commerce benchmarks show that our approach achieves superior performance over the other state-of-the-art techniques.

Related papers

IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification [60.38841251693781]
We propose a novel framework to generate robust multi-modal object ReIDs. Our framework uses Modal Prefixes and InverseNet to integrate multi-modal information with semantic guidance from inverted text. Experiments on three multi-modal object ReID benchmarks demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2025-03-13T13:00:31Z)
M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning [90.75075886543404]
Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains. In this work, we introduce a novel Multimodal Prompt Tuning (M$2$PT) approach for efficient instruction tuning of MLLMs.
arXiv Detail & Related papers (2024-09-24T01:40:24Z)
Fine-tuning Multimodal Large Language Models for Product Bundling [53.01642741096356]
We introduce Bundle-MLLM, a novel framework that fine-tunes large language models (LLMs) through a hybrid item tokenization approach. Specifically, we integrate textual, media, and relational data into a unified tokenization, introducing a soft separation token to distinguish between textual and non-textual tokens. We propose a progressive optimization strategy that fine-tunes LLMs for disentangled objectives: 1) learning bundle patterns and 2) enhancing multimodal semantic understanding specific to product bundling.
arXiv Detail & Related papers (2024-07-16T13:30:14Z)
Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding. We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL. UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
MMAPS: End-to-End Multi-Grained Multi-Modal Attribute-Aware Product Summarization [93.5217515566437]
Multi-modal Product Summarization (MPS) aims to increase customers' desire to purchase by highlighting product characteristics. Existing MPS methods can produce promising results, but they still lack end-to-end product summarization. We propose an end-to-end multi-modal attribute-aware product summarization method (MMAPS) for generating high-quality product summaries in e-commerce.
arXiv Detail & Related papers (2023-08-22T11:00:09Z)
Knowledge Perceived Multi-modal Pretraining in E-commerce [12.012793707741562]
Current multi-modal pretraining methods for image and text modalities lack robustness in the face of modality-missing and modality-noise. We propose K3M, which introduces knowledge modality in multi-modal pretraining to correct the noise and supplement the missing of image and text modalities.
arXiv Detail & Related papers (2021-08-20T08:01:28Z)
Automatic Validation of Textual Attribute Values in E-commerce Catalog by Learning with Limited Labeled Data [61.789797281676606]
We propose a novel meta-learning latent variable approach, called MetaBridge. It can learn transferable knowledge from a subset of categories with limited labeled data. It can capture the uncertainty of never-seen categories with unlabeled data.
arXiv Detail & Related papers (2020-06-15T21:31:05Z)
Adversarial Multimodal Representation Learning for Click-Through Rate Prediction [16.10640369157054]
We propose a novel Multimodal Adversarial Representation Network (MARN) for the Click-Through Rate (CTR) prediction task. A multimodal attention network first calculates the weights of multiple modalities for each item according to its modality-specific features. A multimodal adversarial network learns modality-in representations where a double-discriminators strategy is introduced.
arXiv Detail & Related papers (2020-03-07T15:50:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.