FuXi-$α$: Scaling Recommendation Model with Feature Interaction Enhanced Transformer
- URL: http://arxiv.org/abs/2502.03036v1
- Date: Wed, 05 Feb 2025 09:46:54 GMT
- Title: FuXi-$α$: Scaling Recommendation Model with Feature Interaction Enhanced Transformer
- Authors: Yufei Ye, Wei Guo, Jin Yao Chin, Hao Wang, Hong Zhu, Xi Lin, Yuyang Ye, Yong Liu, Ruiming Tang, Defu Lian, Enhong Chen,
- Abstract summary: Recent advancements have shown that expanding sequential recommendation models to large-scale recommendation models can be an effective strategy.
We propose a new model called FuXi-$alpha$ to address these issues.
Our model outperforms existing models, with its performance continuously improving as the model size increases.
- Score: 81.12174905444229
- License:
- Abstract: Inspired by scaling laws and large language models, research on large-scale recommendation models has gained significant attention. Recent advancements have shown that expanding sequential recommendation models to large-scale recommendation models can be an effective strategy. Current state-of-the-art sequential recommendation models primarily use self-attention mechanisms for explicit feature interactions among items, while implicit interactions are managed through Feed-Forward Networks (FFNs). However, these models often inadequately integrate temporal and positional information, either by adding them to attention weights or by blending them with latent representations, which limits their expressive power. A recent model, HSTU, further reduces the focus on implicit feature interactions, constraining its performance. We propose a new model called FuXi-$\alpha$ to address these issues. This model introduces an Adaptive Multi-channel Self-attention mechanism that distinctly models temporal, positional, and semantic features, along with a Multi-stage FFN to enhance implicit feature interactions. Our offline experiments demonstrate that our model outperforms existing models, with its performance continuously improving as the model size increases. Additionally, we conducted an online A/B test within the Huawei Music app, which showed a $4.76\%$ increase in the average number of songs played per user and a $5.10\%$ increase in the average listening duration per user. Our code has been released at https://github.com/USTC-StarTeam/FuXi-alpha.
Related papers
- AdaF^2M^2: Comprehensive Learning and Responsive Leveraging Features in Recommendation System [16.364341783911414]
We propose a model-agnostic framework AdaF2M2, short for Adaptive Feature Modeling with Feature Mask.
By arming base models with AdaF2M2, we conduct online A/B tests on multiple recommendation scenarios, obtaining +1.37% and +1.89% cumulative improvements on user active days and app duration respectively.
arXiv Detail & Related papers (2025-01-27T06:49:27Z) - ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer [95.80384464922147]
Continuous visual generation requires the full-sequence diffusion-based approach.
We present ACDiT, an Autoregressive blockwise Conditional Diffusion Transformer.
We demonstrate that ACDiT can be seamlessly used in visual understanding tasks despite being trained on the diffusion objective.
arXiv Detail & Related papers (2024-12-10T18:13:20Z) - A Collaborative Ensemble Framework for CTR Prediction [73.59868761656317]
We propose a novel framework, Collaborative Ensemble Training Network (CETNet), to leverage multiple distinct models.
Unlike naive model scaling, our approach emphasizes diversity and collaboration through collaborative learning.
We validate our framework on three public datasets and a large-scale industrial dataset from Meta.
arXiv Detail & Related papers (2024-11-20T20:38:56Z) - What Matters for Model Merging at Scale? [94.26607564817786]
Model merging aims to combine multiple expert models into a more capable single model.
Previous studies have primarily focused on merging a few small models.
This study systematically evaluates the utility of model merging at scale.
arXiv Detail & Related papers (2024-10-04T17:17:19Z) - SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction.
SMILE allows for the upscaling of source models into an MoE model without extra data or further training.
We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z) - Continuous Language Model Interpolation for Dynamic and Controllable Text Generation [7.535219325248997]
We focus on the challenging case where the model must dynamically adapt to diverse -- and often changing -- user preferences.
We leverage adaptation methods based on linear weight, casting them as continuous multi-domain interpolators.
We show that varying the weights yields predictable and consistent change in the model outputs.
arXiv Detail & Related papers (2024-04-10T15:55:07Z) - Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion [54.33764537135906]
VideoQA Transformer models demonstrate competitive performance on standard benchmarks.
Do these models capture the rich multimodal structures and dynamics from video and text jointly?
Are they achieving high scores by exploiting biases and spurious features?
arXiv Detail & Related papers (2023-06-15T06:45:46Z) - Scaling Local Self-Attention For Parameter Efficient Visual Backbones [29.396052798583234]
Self-attention has the promise of improving computer vision systems due to parameter-independent scaling of receptive fields and content-dependent interactions.
We develop a new self-attention model family, emphHaloNets, which reach state-of-the-art accuracies on the parameter-limited setting of the ImageNet classification benchmark.
arXiv Detail & Related papers (2021-03-23T17:56:06Z) - Does my multimodal model learn cross-modal interactions? It's harder to
tell than you might think! [26.215781778606168]
Cross-modal modeling seems crucial in multimodal tasks, such as visual question answering.
We propose a new diagnostic tool, empirical multimodally-additive function projection (EMAP), for isolating whether or not cross-modal interactions improve performance for a given model on a given task.
For seven image+text classification tasks (on each of which we set new state-of-the-art benchmarks), we find that, in many cases, removing cross-modal interactions results in little to no performance degradation.
arXiv Detail & Related papers (2020-10-13T17:45:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.