Adversarial Multimodal Representation Learning for Click-Through Rate
Prediction
- URL: http://arxiv.org/abs/2003.07162v1
- Date: Sat, 7 Mar 2020 15:50:23 GMT
- Title: Adversarial Multimodal Representation Learning for Click-Through Rate
Prediction
- Authors: Xiang Li, Chao Wang, Jiwei Tan, Xiaoyi Zeng, Dan Ou, Bo Zheng
- Abstract summary: We propose a novel Multimodal Adversarial Representation Network (MARN) for the Click-Through Rate (CTR) prediction task.
A multimodal attention network first calculates the weights of multiple modalities for each item according to its modality-specific features.
A multimodal adversarial network learns modality-in representations where a double-discriminators strategy is introduced.
- Score: 16.10640369157054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For better user experience and business effectiveness, Click-Through Rate
(CTR) prediction has been one of the most important tasks in E-commerce.
Although extensive CTR prediction models have been proposed, learning good
representation of items from multimodal features is still less investigated,
considering an item in E-commerce usually contains multiple heterogeneous
modalities. Previous works either concatenate the multiple modality features,
that is equivalent to giving a fixed importance weight to each modality; or
learn dynamic weights of different modalities for different items through
technique like attention mechanism. However, a problem is that there usually
exists common redundant information across multiple modalities. The dynamic
weights of different modalities computed by using the redundant information may
not correctly reflect the different importance of each modality. To address
this, we explore the complementarity and redundancy of modalities by
considering modality-specific and modality-invariant features differently. We
propose a novel Multimodal Adversarial Representation Network (MARN) for the
CTR prediction task. A multimodal attention network first calculates the
weights of multiple modalities for each item according to its modality-specific
features. Then a multimodal adversarial network learns modality-invariant
representations where a double-discriminators strategy is introduced. Finally,
we achieve the multimodal item representations by combining both
modality-specific and modality-invariant representations. We conduct extensive
experiments on both public and industrial datasets, and the proposed method
consistently achieves remarkable improvements to the state-of-the-art methods.
Moreover, the approach has been deployed in an operational E-commerce system
and online A/B testing further demonstrates the effectiveness.
Related papers
- U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Beyond Unimodal Learning: The Importance of Integrating Multiple Modalities for Lifelong Learning [23.035725779568587]
We study the role and interactions of multiple modalities in mitigating forgetting in deep neural networks (DNNs)
Our findings demonstrate that leveraging multiple views and complementary information from multiple modalities enables the model to learn more accurate and robust representations.
We propose a method for integrating and aligning the information from different modalities by utilizing the relational structural similarities between the data points in each modality.
arXiv Detail & Related papers (2024-05-04T22:02:58Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - MESED: A Multi-modal Entity Set Expansion Dataset with Fine-grained
Semantic Classes and Hard Negative Entities [25.059177235004952]
We propose Multi-modal Entity Set Expansion (MESE), where models integrate information from multiple modalities to represent entities.
A powerful multi-modal model MultiExpan is proposed which is pre-trained on four multimodal pre-training tasks.
The MESED dataset is the first multi-modal dataset for ESE with large-scale and elaborate manual calibration.
arXiv Detail & Related papers (2023-07-27T14:09:59Z) - DM$^2$S$^2$: Deep Multi-Modal Sequence Sets with Hierarchical Modality
Attention [8.382710169577447]
Methods for extracting important information from multimodal data rely on a mid-fusion architecture.
We propose a new concept that considers multimodal inputs as a set of sequences, namely, deep multimodal sequence sets.
Our concept exhibits performance that is comparable to or better than the previous set-level models.
arXiv Detail & Related papers (2022-09-07T13:25:09Z) - Multi-modal Contrastive Representation Learning for Entity Alignment [57.92705405276161]
Multi-modal entity alignment aims to identify equivalent entities between two different multi-modal knowledge graphs.
We propose MCLEA, a Multi-modal Contrastive Learning based Entity Alignment model.
In particular, MCLEA firstly learns multiple individual representations from multiple modalities, and then performs contrastive learning to jointly model intra-modal and inter-modal interactions.
arXiv Detail & Related papers (2022-09-02T08:59:57Z) - Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment
Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis.
Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z) - Channel Exchanging Networks for Multimodal and Multitask Dense Image
Prediction [125.18248926508045]
We propose Channel-Exchanging-Network (CEN) which is self-adaptive, parameter-free, and more importantly, applicable for both multimodal fusion and multitask learning.
CEN dynamically exchanges channels betweenworks of different modalities.
For the application of dense image prediction, the validity of CEN is tested by four different scenarios.
arXiv Detail & Related papers (2021-12-04T05:47:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.