1$^{st}$ Place Solution of WWW 2025 EReL@MIR Workshop Multimodal CTR Prediction Challenge
- URL: http://arxiv.org/abs/2505.03543v1
- Date: Tue, 06 May 2025 13:55:22 GMT
- Title: 1$^{st}$ Place Solution of WWW 2025 EReL@MIR Workshop Multimodal CTR Prediction Challenge
- Authors: Junwei Xu, Zehao Zhao, Xiaoyu Hu, Zhenjie Song,
- Abstract summary: This report presents our 1$st$ place winning solution for Task 2 of the Multimodal CTR Prediction Challenge.<n>For multimodal information integration, we simply append the frozen multimodal embeddings to each item embedding.<n> Experiments on the challenge dataset demonstrate the effectiveness of our method, achieving superior performance with a 0.9839 AUC on the leaderboard.
- Score: 1.509961504986039
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The WWW 2025 EReL@MIR Workshop Multimodal CTR Prediction Challenge focuses on effectively applying multimodal embedding features to improve click-through rate (CTR) prediction in recommender systems. This technical report presents our 1$^{st}$ place winning solution for Task 2, combining sequential modeling and feature interaction learning to effectively capture user-item interactions. For multimodal information integration, we simply append the frozen multimodal embeddings to each item embedding. Experiments on the challenge dataset demonstrate the effectiveness of our method, achieving superior performance with a 0.9839 AUC on the leaderboard, much higher than the baseline model. Code and configuration are available in our GitHub repository and the checkpoint of our model can be found in HuggingFace.
Related papers
- Quizzard@INOVA Challenge 2025 -- Track A: Plug-and-Play Technique in Interleaved Multi-Image Model [0.5465345065283892]
We demonstrate the impressive performance of the LLaVA-NeXT-interleave on 22 datasets across three different tasks.<n>We add the Dense Channel Integration (DCI) connector to the LLaVA-NeXT-Interleave and compare its performance against the standard model.
arXiv Detail & Related papers (2025-06-13T12:48:39Z) - MAGMaR Shared Task System Description: Video Retrieval with OmniEmbed [55.526939500742]
We use OmniEmbed, a powerful multimodal embedding model from the Tevatron 2.0 toolkit, to generate unified embeddings for text, images, audio, and video.<n>Our submission achieved the highest score on the MAGMaR shared task leaderboard among public submissions as of May 20th, 2025.
arXiv Detail & Related papers (2025-06-11T05:40:26Z) - Feature Fusion Revisited: Multimodal CTR Prediction for MMCTR Challenge [4.3058911704400415]
The EReL@MIR workshop provided a valuable opportunity to experiment with various approaches aimed at improving the efficiency of multimodal representation learning.<n>Our team was honored to receive the award for Task 2 - Winner (Multimodal CTR Prediction)
arXiv Detail & Related papers (2025-04-26T16:04:33Z) - Quadratic Interest Network for Multimodal Click-Through Rate Prediction [12.989347150912685]
Multimodal click-through rate (CTR) prediction is a key technique in industrial recommender systems.<n>We propose a novel model for Task 2, named Quadratic Interest Network (QIN) for Multimodal CTR Prediction.
arXiv Detail & Related papers (2025-04-24T16:08:52Z) - Action Recognition Using Temporal Shift Module and Ensemble Learning [0.0]
The paper presents the first-rank solution for the Multi-Modal Action Recognition Challenge, part of the Multi-Modal Visual Pattern Recognition Workshop at the aclICPR 2024.<n>The competition aimed to recognize human actions using a diverse dataset of 20 action classes, collected from multi-modal sources.<n>Our solution achieved a perfect top-1 accuracy on the test set, demonstrating the effectiveness of the proposed approach in recognizing human actions across 20 classes.
arXiv Detail & Related papers (2025-01-29T10:36:55Z) - Modeling Multi-Task Model Merging as Adaptive Projective Gradient Descent [72.10987117380584]
Merging multiple expert models offers a promising approach for performing multi-task learning without accessing their original data.<n>We find existing methods discard task-specific information that, while causing conflicts, is crucial for performance.<n>Our approach consistently outperforms previous methods, achieving state-of-the-art results across diverse architectures and tasks in both vision and NLP domains.
arXiv Detail & Related papers (2025-01-02T12:45:21Z) - SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation [61.392147185793476]
We present a unified and versatile foundation model, namely, SEED-X.<n>SEED-X is able to model multi-granularity visual semantics for comprehension and generation tasks.<n>We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications.
arXiv Detail & Related papers (2024-04-22T17:56:09Z) - Large Language Models aren't all that you need [0.0]
This paper describes the architecture and systems built towards solving the SemEval 2023 Task 2: MultiCoNER II.
We evaluate two approaches (a) a traditional Random Fields model and (b) a Large Language Model (LLM) fine-tuned with a customized head and compare the two approaches.
arXiv Detail & Related papers (2024-01-01T08:32:50Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist
Models [72.8156832931841]
Generalist models are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model.
We release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction.
arXiv Detail & Related papers (2022-12-08T17:07:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.