Related papers: SARM: LLM-Augmented Semantic Anchor for End-to-End Live-Streaming Ranking

SARM: LLM-Augmented Semantic Anchor for End-to-End Live-Streaming Ranking

URL: http://arxiv.org/abs/2602.09401v1
Date: Tue, 10 Feb 2026 04:15:53 GMT
Title: SARM: LLM-Augmented Semantic Anchor for End-to-End Live-Streaming Ranking
Authors: Ruochen Yang, Yueyang Liu, Zijie Zhuang, Changxin Lao, Yuhui Zhang, Jiangxia Cao, Jia Xu, Xiang Chen, Haoke Xiao, Xiangyu Wu, Xiaoyou Zhou, Xiao Lv, Shuang Yang, Tingwen Liu, Zhaojie Liu, Han Li, Kun Gai,
Abstract summary: Large-scale live-streaming recommendation requires precise modeling of non-stationary content semantics under real-time serving constraints.<n>We propose textbfSARM, an end-to-end ranking architecture that integrates natural-language semantic anchors directly into ranking optimization.<n>SARM is fully deployed and serves over 400 million users daily.
Score: 49.109782956280064
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large-scale live-streaming recommendation requires precise modeling of non-stationary content semantics under strict real-time serving constraints. In industrial deployment, two common approaches exhibit fundamental limitations: discrete semantic abstractions sacrifice descriptive precision through clustering, while dense multimodal embeddings are extracted independently and remain weakly aligned with ranking optimization, limiting fine-grained content-aware ranking. To address these limitations, we propose \textbf{SARM}, an end-to-end ranking architecture that integrates natural-language semantic anchors directly into ranking optimization, enabling fine-grained author representations conditioned on multimodal content. Each semantic anchor is represented as learnable text tokens jointly optimized with ranking features, allowing the model to adapt content descriptions to ranking objectives. A lightweight dual-token gated design captures domain-specific live-streaming semantics, while an asymmetric deployment strategy preserves low-latency online training and serving. Extensive offline evaluation and large-scale A/B tests show consistent improvements over production baselines. SARM is fully deployed and serves over 400 million users daily.

Related papers

MacNet: An End-to-End Manifold-Constrained Adaptive Clustering Network for Interpretable Whole Slide Image Classification [9.952997875404634]
Clustering-based approaches can provide explainable decision-making process but suffer from high dimension features and semantically ambiguous centroids.<n>We propose an end-to-end MIL framework that integrates Grassmann re-embedding and manifold adaptive clustering.<n> Experiments on multicentre WSI datasets demonstrate that: 1) our cluster-incorporated model achieves superior performance in both grading accuracy and interpretability; 2) end-to-end learning refines better feature representations and it requires acceptable resources.
arXiv Detail & Related papers (2026-02-16T06:43:36Z)
Query as Anchor: Scenario-Adaptive User Representation via Large Language Model [28.30329175937291]
We propose Query-as-Anchor, a framework shifting user modeling from static encoding to dynamic, query-aware synthesis.<n>We first construct UserU, an industrial-scale pre-training dataset that aligns behavioral sequences with user understanding semantics.<n>We introduce Cluster-based Soft Prompt Tuning to enforce discriminative latent structures.<n>For deployment, anchoring queries at sequence termini enables KV-cache-accelerated inference with negligible incremental latency.
arXiv Detail & Related papers (2026-02-16T06:09:31Z)
Bootstrapping MLLM for Weakly-Supervised Class-Agnostic Object Counting [59.37613121962146]
We propose WS-COC, the first MLLM-driven weakly-supervised framework for class-agnostic object counting.<n> WS-COC matches or even surpasses many state-of-art fully-supervised methods while significantly reducing annotation costs.
arXiv Detail & Related papers (2026-02-13T09:58:35Z)
Generalizable Prompt Tuning for Audio-Language Models via Semantic Expansion [32.60365302637783]
We propose Semantically Expanded Prompt Tuning (SEPT) for prompt tuning in audio-language models (ALMs)<n>SEPT regularizes the prompt embedding space by incorporating semantic neighbors generated by large language models.<n>Extensive experiments demonstrate that SEPT consistently improves generalization performance across multiple prompt tuning baselines.
arXiv Detail & Related papers (2026-01-06T12:47:32Z)
Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning [63.109585527799005]
GroundingAgent is a visual grounding framework that operates without task-specific fine-tuning.<n>It achieves an average zero-shot grounding accuracy of 65.1 % on widely-used benchmarks.<n>It also offers strong interpretability, transparently illustrating each reasoning step.
arXiv Detail & Related papers (2025-11-24T03:11:08Z)
Few-Shot Remote Sensing Image Scene Classification with CLIP and Prompt Learning [0.9558392439655014]
We explore prompt learning as a lightweight and efficient adaptation strategy for few-shot remote sensing image scene classification.<n>We benchmark these prompt-learning methods against two standard baselines: zero-shot CLIP with hand-crafted prompts and a linear probe trained on frozen CLIP features.<n>Our findings underscore prompt learning as a scalable and efficient solution for bridging the domain gap in satellite and aerial imagery.
arXiv Detail & Related papers (2025-10-28T11:39:22Z)
Joint Learning using Mixture-of-Expert-Based Representation for Enhanced Speech Generation and Robust Emotion Recognition [54.44798086835314]
Speech emotion recognition (SER) plays a critical role in building emotion-aware speech systems, but its performance degrades significantly under noisy conditions.<n>We propose the Sparse Mixture-of-Experts Representation Integration Technique (Sparse MERIT), a flexible MTL framework that applies frame-wise expert routing over self-supervised speech representations.<n> Experiments on the MSP-Podcast corpus show that Sparse MERIT consistently outperforms baseline models on both SER and SE tasks.
arXiv Detail & Related papers (2025-09-10T10:18:56Z)
EGFormer: Towards Efficient and Generalizable Multimodal Semantic Segmentation [6.314084134346798]
EGFormer is an efficient multimodal semantic segmentation framework.<n>It flexibly integrates an arbitrary number of modalities while significantly reducing model parameters and inference time.<n>It achieves competitive performance with up to 88 percent reduction in parameters and 50 percent fewer GFLOPs.
arXiv Detail & Related papers (2025-05-20T07:08:49Z)
Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation [72.28364940168092]
Open-vocabulary semantic segmentation models associate vision and text to label pixels from an undefined set of classes using textual queries.<n>We introduce Semantic Library Adaptation (SemLA), a novel framework for training-free, test-time domain adaptation.
arXiv Detail & Related papers (2025-03-27T17:59:58Z)
Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation [36.46163240168576]
Open-vocabulary semantic segmentation (OVSS) is an open-world task that aims to assign each pixel within an image to a specific class defined by arbitrary text descriptions.<n>Recent advancements in large-scale vision-language models have demonstrated their open-vocabulary understanding capabilities.<n>This study introduces ERR-Seg, a novel framework that effectively reduces redundancy to balance accuracy and efficiency.
arXiv Detail & Related papers (2025-01-29T13:24:53Z)
MAGIC++: Efficient and Resilient Modality-Agnostic Semantic Segmentation via Hierarchical Modality Selection [20.584588303521496]
We introduce the MAGIC++ framework, which comprises two key plug-and-play modules for effective multi-modal fusion and hierarchical modality selection.<n>Our method achieves state-of-the-art performance on both real-world and synthetic benchmarks.<n>Our method is superior in the novel modality-agnostic setting, where it outperforms prior arts by a large margin.
arXiv Detail & Related papers (2024-12-22T06:12:03Z)
Preserving Modality Structure Improves Multi-Modal Learning [64.10085674834252]
Self-supervised learning on large-scale multi-modal datasets allows learning semantically meaningful embeddings without relying on human annotations. These methods often struggle to generalize well on out-of-domain data as they ignore the semantic structure present in modality-specific embeddings. We propose a novel Semantic-Structure-Preserving Consistency approach to improve generalizability by preserving the modality-specific relationships in the joint embedding space.
arXiv Detail & Related papers (2023-08-24T20:46:48Z)
USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality. Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.