Quadratic Interest Network for Multimodal Click-Through Rate Prediction
- URL: http://arxiv.org/abs/2504.17699v2
- Date: Fri, 25 Apr 2025 05:02:28 GMT
- Title: Quadratic Interest Network for Multimodal Click-Through Rate Prediction
- Authors: Honghao Li, Hanwei Li, Jing Zhang, Yi Zhang, Ziniu Yu, Lei Sang, Yiwen Zhang,
- Abstract summary: Multimodal click-through rate (CTR) prediction is a key technique in industrial recommender systems.<n>We propose a novel model for Task 2, named Quadratic Interest Network (QIN) for Multimodal CTR Prediction.
- Score: 12.989347150912685
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal click-through rate (CTR) prediction is a key technique in industrial recommender systems. It leverages heterogeneous modalities such as text, images, and behavioral logs to capture high-order feature interactions between users and items, thereby enhancing the system's understanding of user interests and its ability to predict click behavior. The primary challenge in this field lies in effectively utilizing the rich semantic information from multiple modalities while satisfying the low-latency requirements of online inference in real-world applications. To foster progress in this area, the Multimodal CTR Prediction Challenge Track of the WWW 2025 EReL@MIR Workshop formulates the problem into two tasks: (1) Task 1 of Multimodal Item Embedding: this task aims to explore multimodal information extraction and item representation learning methods that enhance recommendation tasks; and (2) Task 2 of Multimodal CTR Prediction: this task aims to explore what multimodal recommendation model can effectively leverage multimodal embedding features and achieve better performance. In this paper, we propose a novel model for Task 2, named Quadratic Interest Network (QIN) for Multimodal CTR Prediction. Specifically, QIN employs adaptive sparse target attention to extract multimodal user behavior features, and leverages Quadratic Neural Networks to capture high-order feature interactions. As a result, QIN achieved an AUC of 0.9798 on the leaderboard and ranked second in the competition. The model code, training logs, hyperparameter configurations, and checkpoints are available at https://github.com/salmon1802/QIN.
Related papers
- On the Practice of Deep Hierarchical Ensemble Network for Ad Conversion Rate Prediction [14.649184507551436]
We propose a multitask learning framework with DHEN as the single backbone model architecture to predict all CVR tasks.<n>We build both on-site real-time user behavior sequences and off-site conversion event sequences for CVR prediction purposes.<n>Our method achieves state-of-the-art performance compared to previous single feature crossing modules with pre-trained user personalization features.
arXiv Detail & Related papers (2025-04-10T23:41:34Z) - M3Net: Multimodal Multi-task Learning for 3D Detection, Segmentation, and Occupancy Prediction in Autonomous Driving [48.17490295484055]
M3Net is a novel network that simultaneously tackles detection, segmentation, and 3D occupancy prediction for autonomous driving.<n>M3Net achieves state-of-the-art multi-task learning performance on the nuScenes benchmarks.
arXiv Detail & Related papers (2025-03-23T15:08:09Z) - One Framework to Rule Them All: Unifying Multimodal Tasks with LLM Neural-Tuning [16.96824902454355]
We propose a unified framework that concurrently handles multiple tasks and modalities.<n>In this framework, all modalities and tasks are represented as unified tokens and trained using a single, consistent approach.<n>We present a new benchmark, MMUD, which includes samples annotated with multiple task labels.<n>We demonstrate the ability to handle multiple tasks simultaneously in a streamlined and efficient manner.
arXiv Detail & Related papers (2024-08-06T07:19:51Z) - SEMINAR: Search Enhanced Multi-modal Interest Network and Approximate Retrieval for Lifelong Sequential Recommendation [16.370075234443245]
We propose a unified lifelong multi-modal sequence model called SEMINAR-Search Enhanced Multi-Modal Interest Network and Approximate Retrieval.
Specifically, a network called Pretraining Search Unit learns the lifelong sequences of multi-modal query-item pairs in a pretraining-finetuning manner.
To accelerate the online retrieval speed of multi-modal embedding, we propose a multi-modal codebook-based product quantization strategy.
arXiv Detail & Related papers (2024-07-15T13:33:30Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Controllable Dynamic Multi-Task Architectures [92.74372912009127]
We propose a controllable multi-task network that dynamically adjusts its architecture and weights to match the desired task preference as well as the resource constraints.
We propose a disentangled training of two hypernetworks, by exploiting task affinity and a novel branching regularized loss, to take input preferences and accordingly predict tree-structured models with adapted weights.
arXiv Detail & Related papers (2022-03-28T17:56:40Z) - Routing with Self-Attention for Multimodal Capsule Networks [108.85007719132618]
We present a new multimodal capsule network that allows us to leverage the strength of capsules in the context of a multimodal learning framework.
To adapt the capsules to large-scale input data, we propose a novel routing by self-attention mechanism that selects relevant capsules.
This allows not only for robust training with noisy video data, but also to scale up the size of the capsule network compared to traditional routing methods.
arXiv Detail & Related papers (2021-12-01T19:01:26Z) - An Analysis Of Entire Space Multi-Task Models For Post-Click Conversion
Prediction [3.2979460528864926]
We consider approximating the probability of post-click conversion events (installs) for mobile app advertising on a large-scale advertising platform.
We show that several different approaches result in similar levels of positive transfer from the data-abundant CTR task to the CVR task.
Our findings add to the growing body of evidence suggesting that standard multi-task learning is a sensible approach to modelling related events in real-world large-scale applications.
arXiv Detail & Related papers (2021-08-18T13:39:50Z) - Joint predictions of multi-modal ride-hailing demands: a deep multi-task
multigraph learning-based approach [64.18639899347822]
We propose a deep multi-task multi-graph learning approach, which combines multiple multi-graph convolutional (MGC) networks for predicting demands for different service modes.
We show that our propose approach outperforms the benchmark algorithms in prediction accuracy for different ride-hailing modes.
arXiv Detail & Related papers (2020-11-11T07:10:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.