A Multimodal Late Fusion Model for E-Commerce Product Classification
- URL: http://arxiv.org/abs/2008.06179v1
- Date: Fri, 14 Aug 2020 03:46:24 GMT
- Title: A Multimodal Late Fusion Model for E-Commerce Product Classification
- Authors: Ye Bi, Shuo Wang, Zhongrui Fan
- Abstract summary: In this study, we investigated a multimodal late fusion approach based on text and image modalities to categorize e-commerce products on Rakuten.
Specifically, we developed modal specific state-of-the-art deep neural networks for each input modal, and then fused them at the decision level.
Our team named pa_curis won the 1st place with a macro-F1 of 0.9144 on the final leaderboard.
- Score: 7.463657960984954
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The cataloging of product listings is a fundamental problem for most
e-commerce platforms. Despite promising results obtained by unimodal-based
methods, it can be expected that their performance can be further boosted by
the consideration of multimodal product information. In this study, we
investigated a multimodal late fusion approach based on text and image
modalities to categorize e-commerce products on Rakuten. Specifically, we
developed modal specific state-of-the-art deep neural networks for each input
modal, and then fused them at the decision level. Experimental results on
Multimodal Product Classification Task of SIGIR 2020 E-Commerce Workshop Data
Challenge demonstrate the superiority and effectiveness of our proposed method
compared with unimodal and other multimodal methods. Our team named pa_curis
won the 1st place with a macro-F1 of 0.9144 on the final leaderboard.
Related papers
- Multi-modal Retrieval Augmented Multi-modal Generation: A Benchmark, Evaluate Metrics and Strong Baselines [63.427721165404634]
This paper investigates an intriguing task of Multi-modal Retrieval Augmented Multi-modal Generation (M$2$RAG)
This task requires foundation models to browse multi-modal web pages, with mixed text and images, and generate multi-modal responses for solving user queries.
We construct a benchmark for M$2$RAG task, equipped with a suite of text-modal metrics and multi-modal metrics to analyze the capabilities of existing foundation models.
arXiv Detail & Related papers (2024-11-25T13:20:19Z) - MIND: Multimodal Shopping Intention Distillation from Large Vision-language Models for E-commerce Purchase Understanding [67.26334044239161]
MIND is a framework that infers purchase intentions from multimodal product metadata and prioritizes human-centric ones.
Using Amazon Review data, we create a multimodal intention knowledge base, which contains 1,264,441 million intentions.
Our obtained intentions significantly enhance large language models in two intention comprehension tasks.
arXiv Detail & Related papers (2024-06-15T17:56:09Z) - End-to-end multi-modal product matching in fashion e-commerce [0.6047429555885261]
We present a robust multi-modal product matching system in an industry setting.
We show how a human-in-the-loop process can be combined with model-based predictions to achieve near perfect precision.
arXiv Detail & Related papers (2024-03-18T09:12:16Z) - MMAPS: End-to-End Multi-Grained Multi-Modal Attribute-Aware Product
Summarization [93.5217515566437]
Multi-modal Product Summarization (MPS) aims to increase customers' desire to purchase by highlighting product characteristics.
Existing MPS methods can produce promising results, but they still lack end-to-end product summarization.
We propose an end-to-end multi-modal attribute-aware product summarization method (MMAPS) for generating high-quality product summaries in e-commerce.
arXiv Detail & Related papers (2023-08-22T11:00:09Z) - Multimodal E-Commerce Product Classification Using Hierarchical Fusion [0.0]
The proposed method significantly outperformed the unimodal models' performance and the reported performance of similar models on our specific task.
We did experiments with multiple fusing techniques and found, that the best performing technique to combine the individual embedding of the unimodal network is based on combining concatenation and averaging the feature vectors.
arXiv Detail & Related papers (2022-07-07T14:04:42Z) - Multi-Modal Attribute Extraction for E-Commerce [4.626261940793027]
We develop a novel approach to seamlessly combine modalities, which is inspired by our single-modality investigations.
Experiments on Rakuten-Ichiba data provide empirical evidence for the benefits of our approach.
arXiv Detail & Related papers (2022-03-07T14:48:44Z) - M5Product: A Multi-modal Pretraining Benchmark for E-commercial Product
Downstream Tasks [94.80043324367858]
We contribute a large-scale dataset, named M5Product, which consists of over 6 million multimodal pairs.
M5Product contains rich information of multiple modalities including image, text, table, video and audio.
arXiv Detail & Related papers (2021-09-09T13:50:22Z) - Product1M: Towards Weakly Supervised Instance-Level Product Retrieval
via Cross-modal Pretraining [108.86502855439774]
We investigate a more realistic setting that aims to perform weakly-supervised multi-modal instance-level product retrieval.
We contribute Product1M, one of the largest multi-modal cosmetic datasets for real-world instance-level retrieval.
We propose a novel model named Cross-modal contrAstive Product Transformer for instance-level prodUct REtrieval (CAPTURE)
arXiv Detail & Related papers (2021-07-30T12:11:24Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.