Feature Fusion Revisited: Multimodal CTR Prediction for MMCTR Challenge
- URL: http://arxiv.org/abs/2504.18961v1
- Date: Sat, 26 Apr 2025 16:04:33 GMT
- Title: Feature Fusion Revisited: Multimodal CTR Prediction for MMCTR Challenge
- Authors: Junjie Zhou,
- Abstract summary: The EReL@MIR workshop provided a valuable opportunity to experiment with various approaches aimed at improving the efficiency of multimodal representation learning.<n>Our team was honored to receive the award for Task 2 - Winner (Multimodal CTR Prediction)
- Score: 4.3058911704400415
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rapid advancement of Multimodal Large Language Models (MLLMs), an increasing number of researchers are exploring their application in recommendation systems. However, the high latency associated with large models presents a significant challenge for such use cases. The EReL@MIR workshop provided a valuable opportunity to experiment with various approaches aimed at improving the efficiency of multimodal representation learning for information retrieval tasks. As part of the competition's requirements, participants were mandated to submit a technical report detailing their methodologies and findings. Our team was honored to receive the award for Task 2 - Winner (Multimodal CTR Prediction). In this technical report, we present our methods and key findings. Additionally, we propose several directions for future work, particularly focusing on how to effectively integrate recommendation signals into multimodal representations. The codebase for our implementation is publicly available at: https://github.com/Lattice-zjj/MMCTR_Code, and the trained model weights can be accessed at: https://huggingface.co/FireFlyCourageous/MMCTR_DIN_MicroLens_1M_x1.
Related papers
- Quadratic Interest Network for Multimodal Click-Through Rate Prediction [12.989347150912685]
Multimodal click-through rate (CTR) prediction is a key technique in industrial recommender systems.<n>We propose a novel model for Task 2, named Quadratic Interest Network (QIN) for Multimodal CTR Prediction.
arXiv Detail & Related papers (2025-04-24T16:08:52Z) - The 1st EReL@MIR Workshop on Efficient Representation Learning for Multimodal Information Retrieval [49.587042083937426]
We propose the first EReL@MIR workshop at the Web Conference 2025, inviting participants to explore novel solutions.<n>This workshop aims to provide a platform for both academic and industry researchers to engage in discussions, share insights, and foster collaboration.
arXiv Detail & Related papers (2025-04-21T01:10:59Z) - CROSSAN: Towards Efficient and Effective Adaptation of Multiple Multimodal Foundation Models for Sequential Recommendation [6.013740443562439]
Multimodal Foundation Models (MFMs) excel at representing diverse raw modalities.<n>MFMs' application in sequential recommendation remains largely unexplored.<n>It remains unclear whether we can efficiently adapt multiple (>2) MFMs for the sequential recommendation task.<n>We propose a plug-and-play Cross-modal Side Adapter Network (CROSSAN)
arXiv Detail & Related papers (2025-04-14T15:14:59Z) - VisualPRM: An Effective Process Reward Model for Multimodal Reasoning [76.35753243272521]
We introduce VisualPRM, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs)
Our model achieves a 5.9-point improvement across seven multimodal reasoning benchmarks.
For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels.
arXiv Detail & Related papers (2025-03-13T12:03:37Z) - MIND: Modality-Informed Knowledge Distillation Framework for Multimodal Clinical Prediction Tasks [50.98856172702256]
We propose the Modality-INformed knowledge Distillation (MIND) framework, a multimodal model compression approach.<n>MIND transfers knowledge from ensembles of pre-trained deep neural networks of varying sizes into a smaller multimodal student.<n>We evaluate MIND on binary and multilabel clinical prediction tasks using time series data and chest X-ray images.
arXiv Detail & Related papers (2025-02-03T08:50:00Z) - Ducho meets Elliot: Large-scale Benchmarks for Multimodal Recommendation [9.506245109666907]
Multi-faceted features characterizing products and services may influence each customer on online selling platforms differently.
The common multimodal recommendation pipeline involves (i) extracting multimodal features, (ii) refining their high-level representations to suit the recommendation task, and (iv) predicting the user-item score.
This paper settles as the first attempt to offer a large-scale benchmarking for multimodal recommender systems, with a specific focus on multimodal extractors.
arXiv Detail & Related papers (2024-09-24T08:29:10Z) - Alt-MoE:A Scalable Framework for Bidirectional Multimodal Alignment and Efficient Knowledge Integration [6.928469290518152]
Multimodal learning has advanced significantly by aligning different modalities within shared latent spaces.<n>Direct alignment struggles to fully leverage rich intra-modal knowledge, often requiring extensive training data to achieve cross-modal representation.<n>We introduce Alt-MoE, a scalable multimodal alignment framework that employs a mixture of experts (MoE) model as a multi-directional connector across modalities.
arXiv Detail & Related papers (2024-09-09T10:40:50Z) - NoteLLM-2: Multimodal Large Representation Models for Recommendation [71.87790090964734]
Large Language Models (LLMs) have demonstrated exceptional proficiency in text understanding and embedding tasks.<n>Their potential in multimodal representation, particularly for item-to-item (I2I) recommendations, remains underexplored.<n>We propose an end-to-end fine-tuning method that customizes the integration of any existing LLMs and vision encoders for efficient multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z) - Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts [54.529880848937104]
We develop a unified MLLM with the MoE architecture, named Uni-MoE, that can handle a wide array of modalities.
Specifically, it features modality-specific encoders with connectors for a unified multimodal representation.
We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets.
arXiv Detail & Related papers (2024-05-18T12:16:01Z) - Multimodal Recommender Systems: A Survey [50.23505070348051]
Multimodal Recommender System (MRS) has attracted much attention from both academia and industry recently.
In this paper, we will give a comprehensive survey of the MRS models, mainly from technical views.
To access more details of the surveyed papers, such as implementation code, we open source a repository.
arXiv Detail & Related papers (2023-02-08T05:12:54Z) - Multi-Task Fusion via Reinforcement Learning for Long-Term User
Satisfaction in Recommender Systems [3.4394890850129007]
We propose a Batch Reinforcement Learning based Multi-Task Fusion framework (BatchRL-MTF)
We learn an optimal recommendation policy from the fixed batch data offline for long-term user satisfaction.
With a comprehensive investigation on user behaviors, we model the user satisfaction reward with subtles from two aspects of user stickiness and user activeness.
arXiv Detail & Related papers (2022-08-09T06:35:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.