Multi-Modal Attribute Extraction for E-Commerce
- URL: http://arxiv.org/abs/2203.03441v1
- Date: Mon, 7 Mar 2022 14:48:44 GMT
- Title: Multi-Modal Attribute Extraction for E-Commerce
- Authors: Alo\"is De la Comble, Anuvabh Dutt, Pablo Montalvo, Aghiles Salah
- Abstract summary: We develop a novel approach to seamlessly combine modalities, which is inspired by our single-modality investigations.
Experiments on Rakuten-Ichiba data provide empirical evidence for the benefits of our approach.
- Score: 4.626261940793027
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To improve users' experience as they navigate the myriad of options offered
by online marketplaces, it is essential to have well-organized product
catalogs. One key ingredient to that is the availability of product attributes
such as color or material. However, on some marketplaces such as
Rakuten-Ichiba, which we focus on, attribute information is often incomplete or
even missing. One promising solution to this problem is to rely on deep models
pre-trained on large corpora to predict attributes from unstructured data, such
as product descriptive texts and images (referred to as modalities in this
paper). However, we find that achieving satisfactory performance with this
approach is not straightforward but rather the result of several refinements,
which we discuss in this paper. We provide a detailed description of our
approach to attribute extraction, from investigating strong single-modality
methods, to building a solid multimodal model combining textual and visual
information. One key component of our multimodal architecture is a novel
approach to seamlessly combine modalities, which is inspired by our
single-modality investigations. In practice, we notice that this new
modality-merging method may suffer from a modality collapse issue, i.e., it
neglects one modality. Hence, we further propose a mitigation to this problem
based on a principled regularization scheme. Experiments on Rakuten-Ichiba data
provide empirical evidence for the benefits of our approach, which has been
also successfully deployed to Rakuten-Ichiba. We also report results on
publicly available datasets showing that our model is competitive compared to
several recent multimodal and unimodal baselines.
Related papers
- Versatile Medical Image Segmentation Learned from Multi-Source Datasets via Model Self-Disambiguation [9.068045557591612]
We propose a cost-effective alternative that harnesses multi-source data with only partial or sparse segmentation labels for training.
We devise strategies for model self-disambiguation, prior knowledge incorporation, and imbalance mitigation to tackle challenges associated with inconsistently labeled multi-source data.
arXiv Detail & Related papers (2023-11-17T18:28:32Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Read, Look or Listen? What's Needed for Solving a Multimodal Dataset [7.0430001782867]
We propose a two-step method to analyze multimodal datasets, which leverages a small seed of human annotation to map each multimodal instance to the modalities required to process it.
We apply our approach to TVQA, a video question-answering dataset, and discover that most questions can be answered using a single modality, without a substantial bias towards any specific modality.
We analyze the MERLOT Reserve, finding that it struggles with image-based questions compared to text and audio, but also with auditory speaker identification.
arXiv Detail & Related papers (2023-07-06T08:02:45Z) - Modeling Entities as Semantic Points for Visual Information Extraction
in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images.
We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities.
The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z) - Multi-Modal Experience Inspired AI Creation [33.34566822058209]
We study how to generate texts based on sequential multi-modal information.
We firstly design a multi-channel sequence-to-sequence architecture equipped with a multi-modal attention network.
We then propose a curriculum negative sampling strategy tailored for the sequential inputs.
arXiv Detail & Related papers (2022-09-02T11:50:41Z) - Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product
Retrieval [152.3504607706575]
This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories.
We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks.
We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
arXiv Detail & Related papers (2022-06-17T15:40:45Z) - On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification.
We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned.
Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z) - Latent Structures Mining with Contrastive Modality Fusion for Multimedia
Recommendation [22.701371886522494]
We argue that the latent semantic item-item structures underlying multimodal contents could be beneficial for learning better item representations.
We devise a novel modality-aware structure learning module, which learns item-item relationships for each modality.
arXiv Detail & Related papers (2021-11-01T03:37:02Z) - Product1M: Towards Weakly Supervised Instance-Level Product Retrieval
via Cross-modal Pretraining [108.86502855439774]
We investigate a more realistic setting that aims to perform weakly-supervised multi-modal instance-level product retrieval.
We contribute Product1M, one of the largest multi-modal cosmetic datasets for real-world instance-level retrieval.
We propose a novel model named Cross-modal contrAstive Product Transformer for instance-level prodUct REtrieval (CAPTURE)
arXiv Detail & Related papers (2021-07-30T12:11:24Z) - Mining Latent Structures for Multimedia Recommendation [46.70109406399858]
We propose a LATent sTructure mining method for multImodal reCommEndation, which we term LATTICE for brevity.
We learn item-item structures for each modality and aggregates multiple modalities to obtain latent item graphs.
Based on the learned latent graphs, we perform graph convolutions to explicitly inject high-order item affinities into item representations.
arXiv Detail & Related papers (2021-04-19T03:50:24Z) - Automatic Validation of Textual Attribute Values in E-commerce Catalog
by Learning with Limited Labeled Data [61.789797281676606]
We propose a novel meta-learning latent variable approach, called MetaBridge.
It can learn transferable knowledge from a subset of categories with limited labeled data.
It can capture the uncertainty of never-seen categories with unlabeled data.
arXiv Detail & Related papers (2020-06-15T21:31:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.