Make It Long, Keep It Fast: End-to-End 10k-Sequence Modeling at Billion Scale on Douyin
- URL: http://arxiv.org/abs/2511.06077v1
- Date: Sat, 08 Nov 2025 17:22:54 GMT
- Title: Make It Long, Keep It Fast: End-to-End 10k-Sequence Modeling at Billion Scale on Douyin
- Authors: Lin Guan, Jia-Qi Yang, Zhishan Zhao, Beichuan Zhang, Bo Sun, Xuanyuan Luo, Jinan Ni, Xiaowen Li, Yuhang Qi, Zhifang Fan, Hangyu Wang, Qiwei Chen, Yi Cheng, Feng Zhang, Xiao Yang,
- Abstract summary: Short-video recommenders such as Douyin must exploit extremely long user histories without breaking latency or cost budgets.<n>We present an end-to-end system that scales long-length modeling to 10k histories in production histories.
- Score: 21.0248704845397
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Short-video recommenders such as Douyin must exploit extremely long user histories without breaking latency or cost budgets. We present an end-to-end system that scales long-sequence modeling to 10k-length histories in production. First, we introduce Stacked Target-to-History Cross Attention (STCA), which replaces history self-attention with stacked cross-attention from the target to the history, reducing complexity from quadratic to linear in sequence length and enabling efficient end-to-end training. Second, we propose Request Level Batching (RLB), a user-centric batching scheme that aggregates multiple targets for the same user/request to share the user-side encoding, substantially lowering sequence-related storage, communication, and compute without changing the learning objective. Third, we design a length-extrapolative training strategy -- train on shorter windows, infer on much longer ones -- so the model generalizes to 10k histories without additional training cost. Across offline and online experiments, we observe predictable, monotonic gains as we scale history length and model capacity, mirroring the scaling law behavior observed in large language models. Deployed at full traffic on Douyin, our system delivers significant improvements on key engagement metrics while meeting production latency, demonstrating a practical path to scaling end-to-end long-sequence recommendation to the 10k regime.
Related papers
- GEMs: Breaking the Long-Sequence Barrier in Generative Recommendation with a Multi-Stream Decoder [54.64137490632567]
We propose a novel and unified framework designed to capture users' sequences from long-term history.<n>Generative Multi-streamers ( GEMs) break user sequences into three streams.<n>Extensive experiments on large-scale industrial datasets demonstrate that GEMs significantly outperforms state-the-art methods in recommendation accuracy.
arXiv Detail & Related papers (2026-02-14T06:42:56Z) - Length-Adaptive Interest Network for Balancing Long and Short Sequence Modeling in CTR Prediction [50.094751096858204]
LAIN is a plug-and-play framework that incorporates sequence length as a conditioning signal to balance long- and short-sequence modeling.<n>Our work offers a general, efficient, and deployable solution to mitigate length-induced bias in sequential recommendation.
arXiv Detail & Related papers (2026-01-27T03:14:20Z) - Beat the long tail: Distribution-Aware Speculative Decoding for RL Training [75.75462952580796]
We propose a Distribution Aware Speculative decoding framework that accelerates RL rollouts without altering model outputs.<n>Experiments on math and code reasoning tasks show that DAS reduces rollout time up to 50% while preserving identical training curves.
arXiv Detail & Related papers (2025-11-17T19:02:12Z) - Massive Memorization with Hundreds of Trillions of Parameters for Sequential Transducer Generative Recommenders [11.073761978382398]
We propose a novel two-stage modeling framework, namely VIrtual Sequential Target Attention (VISTA)<n>VISTA decomposes traditional target attention from a candidate item to user history items into two distinct stages.<n>Our approach achieves significant improvements in offline and online metrics and has been successfully deployed on an industry leading recommendation platform.
arXiv Detail & Related papers (2025-10-24T22:17:49Z) - Sliding Window Training -- Utilizing Historical Recommender Systems Data for Foundation Models [8.298236989162213]
Long-lived recommender systems (RecSys) often encounter lengthy user-item interaction histories that span many years.
To effectively learn long term user preferences, Large RecSys foundation models (FM) need to encode this information in pretraining.
We introduce a sliding window training technique to incorporate long user history sequences during training time without increasing the model input dimension.
arXiv Detail & Related papers (2024-08-21T18:59:52Z) - CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling [52.404072802235234]
We introduce Chunked Instruction-aware State Eviction (CItruS), a novel modeling technique that integrates the attention preferences useful for a downstream task into the eviction process of hidden states.
Our training-free method exhibits superior performance on long sequence comprehension and retrieval tasks over several strong baselines under the same memory budget.
arXiv Detail & Related papers (2024-06-17T18:34:58Z) - Training-Free Long-Context Scaling of Large Language Models [114.53296002607993]
We propose Dual Chunk Attention, which enables Llama2 70B to support context windows of more than 100k tokens without continual training.
By decomposing the attention for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens.
arXiv Detail & Related papers (2024-02-27T12:39:23Z) - No Length Left Behind: Enhancing Knowledge Tracing for Modeling
Sequences of Excessive or Insufficient Lengths [3.2687390531088414]
Knowledge tracing aims to predict students' responses to practices based on their historical question-answering behaviors.
As sequences get longer, computational costs will increase exponentially.
We propose a model called Sequence-Flexible Knowledge Tracing (SFKT)
arXiv Detail & Related papers (2023-08-07T11:30:58Z) - Sparse Attentive Memory Network for Click-through Rate Prediction with
Long Sequences [10.233015715433602]
We propose a Sparse Attentive Memory network for long sequential user behavior modeling.
SAM supports efficient training and real-time inference for user behavior sequences with lengths on the scale of thousands.
SAM is successfully deployed on one of the largest international E-commerce platforms.
arXiv Detail & Related papers (2022-08-08T10:11:46Z) - Effective and Efficient Training for Sequential Recommendation using
Recency Sampling [91.02268704681124]
We propose a novel Recency-based Sampling of Sequences training objective.
We show that the models enhanced with our method can achieve performances exceeding or very close to stateof-the-art BERT4Rec.
arXiv Detail & Related papers (2022-07-06T13:06:31Z) - Sequential Search with Off-Policy Reinforcement Learning [48.88165680363482]
We propose a highly scalable hybrid learning model that consists of an RNN learning framework and an attention model.
As a novel optimization step, we fit multiple short user sequences in a single RNN pass within a training batch, by solving a greedy knapsack problem on the fly.
We also explore the use of off-policy reinforcement learning in multi-session personalized search ranking.
arXiv Detail & Related papers (2022-02-01T06:52:40Z) - Dynamic Memory based Attention Network for Sequential Recommendation [79.5901228623551]
We propose a novel long sequential recommendation model called Dynamic Memory-based Attention Network (DMAN)
It segments the overall long behavior sequence into a series of sub-sequences, then trains the model and maintains a set of memory blocks to preserve long-term interests of users.
Based on the dynamic memory, the user's short-term and long-term interests can be explicitly extracted and combined for efficient joint recommendation.
arXiv Detail & Related papers (2021-02-18T11:08:54Z) - Longformer: The Long-Document Transformer [40.18988262517733]
Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length.
We introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer.
Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention.
arXiv Detail & Related papers (2020-04-10T17:54:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.