Related papers: OneTrans: Unified Feature Interaction and Sequence Modeling with One Transformer in Industrial Recommender

OneTrans: Unified Feature Interaction and Sequence Modeling with One Transformer in Industrial Recommender

URL: http://arxiv.org/abs/2510.26104v1
Date: Thu, 30 Oct 2025 03:30:12 GMT
Title: OneTrans: Unified Feature Interaction and Sequence Modeling with One Transformer in Industrial Recommender
Authors: Zhaoqi Zhang, Haolei Pei, Jun Guo, Tianyu Wang, Yufei Feng, Hui Sun, Shaowei Liu, Aixin Sun,
Abstract summary: OneTrans is a unified Transformer backbone that simultaneously performs user-behavior sequence modeling and feature interaction.<n>We show that OneTrans scales efficiently with increasing parameters, consistently outperforms strong baselines, and yields a 5.68% lift in per-user GMV in online A/B tests.
Score: 32.265739328468584
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recommendation systems, scaling up feature-interaction modules (e.g., Wukong, RankMixer) or user-behavior sequence modules (e.g., LONGER) has achieved notable success. However, these efforts typically proceed on separate tracks, which not only hinders bidirectional information exchange but also prevents unified optimization and scaling. In this paper, we propose OneTrans, a unified Transformer backbone that simultaneously performs user-behavior sequence modeling and feature interaction. OneTrans employs a unified tokenizer to convert both sequential and non-sequential attributes into a single token sequence. The stacked OneTrans blocks share parameters across similar sequential tokens while assigning token-specific parameters to non-sequential tokens. Through causal attention and cross-request KV caching, OneTrans enables precomputation and caching of intermediate representations, significantly reducing computational costs during both training and inference. Experimental results on industrial-scale datasets demonstrate that OneTrans scales efficiently with increasing parameters, consistently outperforms strong baselines, and yields a 5.68% lift in per-user GMV in online A/B tests.

Related papers

Pooling Attention: Evaluating Pretrained Transformer Embeddings for Deception Classification [0.0]
BERT embeddings combined with logistic regression outperform neural baselines on LIAR dataset splits.<n>This work positions attention-based token encoders as robust, architecture-centric foundations for veracity tasks.
arXiv Detail & Related papers (2025-11-28T08:32:49Z)
Is Attention Required for Transformer Inference? Explore Function-preserving Attention Replacement [13.38679135071682]
We propose a Function-preserving Attention Replacement framework that replaces all attention blocks in pretrained transformers with learnable sequence-to-sequence modules.<n>We validate FAR on the DeiT vision transformer family and demonstrate that it matches the accuracy of the original models on ImageNet and multiple downstream tasks with reduced parameters and latency.
arXiv Detail & Related papers (2025-05-24T02:23:46Z)
RingFormer: Rethinking Recurrent Transformer with Adaptive Level Signals [2.287772422489548]
We propose RingFormer, which employs one Transformer layer that processes input repeatedly in a circular, ring-like manner.<n>This allows us to reduce the model parameters substantially while maintaining high performance in a variety of tasks such as translation and image classification.
arXiv Detail & Related papers (2025-02-18T09:34:31Z)
ALF: Adaptive Label Finetuning for Scene Graph Generation [116.59868289196157]
Scene Graph Generation endeavors to predict the relationships between subjects and objects in a given image. Long-tail distribution of relations often leads to biased prediction on coarse labels, presenting a substantial hurdle in SGG. We introduce one-stage data transfer pipeline in SGG, termed Adaptive Label Finetuning (ALF), which eliminates the need for extra retraining sessions. ALF achieves a 16% improvement in mR@100 compared to the typical SGG method Motif, with only a 6% increase in calculation costs compared to the state-of-the-art method IETrans.
arXiv Detail & Related papers (2023-12-29T01:37:27Z)
Deformable Mixer Transformer with Gating for Multi-Task Learning of Dense Prediction [126.34551436845133]
CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL) We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction.
arXiv Detail & Related papers (2023-08-10T17:37:49Z)
Transformers for End-to-End InfoSec Tasks: A Feasibility Study [6.847381178288385]
We implement transformer models for two distinct InfoSec data formats - specifically URLs and PE files. We show that our URL transformer model requires a different training approach to reach high performance levels. We demonstrate that this approach performs comparably to well-established malware detection models on benchmark PE file datasets.
arXiv Detail & Related papers (2022-12-05T23:50:46Z)
Learning Spatial-Frequency Transformer for Visual Object Tracking [15.750739748843744]
Recent trackers adopt the Transformer to combine or replace the widely used ResNet as their new backbone network. We believe these operations ignore the spatial prior of the target object which may lead to sub-optimal results. We propose a unified Spatial-Frequency Transformer that models the spatial Prior and High-frequency emphasis Attention (GPHA) simultaneously.
arXiv Detail & Related papers (2022-08-18T13:46:12Z)
Joint Spatial-Temporal and Appearance Modeling with Transformer for Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects. The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z)
High-Performance Transformer Tracking [74.07751002861802]
We present a Transformer tracking (named TransT) method based on the Siamese-like feature extraction backbone, the designed attention-based fusion mechanism, and the classification and regression head. Experiments show that our TransT and TransT-M methods achieve promising results on seven popular datasets.
arXiv Detail & Related papers (2022-03-25T09:33:29Z)
TransCMD: Cross-Modal Decoder Equipped with Transformer for RGB-D Salient Object Detection [86.94578023985677]
In this work, we rethink this task from the perspective of global information alignment and transformation. Specifically, the proposed method (TransCMD) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path. Experimental results on seven RGB-D SOD benchmark datasets demonstrate that a simple two-stream encoder-decoder framework can surpass the state-of-the-art purely CNN-based methods.
arXiv Detail & Related papers (2021-12-04T15:45:34Z)
IOT: Instance-wise Layer Reordering for Transformer Structures [173.39918590438245]
We break the assumption of the fixed layer order in the Transformer and introduce instance-wise layer reordering into the model structure. Our method can also be applied to other architectures beyond Transformer.
arXiv Detail & Related papers (2021-03-05T03:44:42Z)
Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding [90.77031668988661]
Cluster-Former is a novel clustering-based sparse Transformer to perform attention across chunked sequences. The proposed framework is pivoted on two unique types of Transformer layer: Sliding-Window Layer and Cluster-Former Layer. Experiments show that Cluster-Former achieves state-of-the-art performance on several major QA benchmarks.
arXiv Detail & Related papers (2020-09-13T22:09:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.