Related papers: Exploiting ID-Text Complementarity via Ensembling for Sequential Recommendation

Exploiting ID-Text Complementarity via Ensembling for Sequential Recommendation

URL: http://arxiv.org/abs/2512.17820v1
Date: Fri, 19 Dec 2025 17:24:12 GMT
Title: Exploiting ID-Text Complementarity via Ensembling for Sequential Recommendation
Authors: Liam Collins, Bhuvesh Kumar, Clark Mingxuan Ju, Tong Zhao, Donald Loveland, Leonardo Neves, Neil Shah,
Abstract summary: We study the complementarity of ID and modality features in Sequential Recommendation models.<n>We propose a new SR method that preserves ID-textarity through independent model training, then harnesses it through a simple ensembling strategy.
Score: 29.942561497927116
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern Sequential Recommendation (SR) models commonly utilize modality features to represent items, motivated in large part by recent advancements in language and vision modeling. To do so, several works completely replace ID embeddings with modality embeddings, claiming that modality embeddings render ID embeddings unnecessary because they can match or even exceed ID embedding performance. On the other hand, many works jointly utilize ID and modality features, but posit that complex fusion strategies, such as multi-stage training and/or intricate alignment architectures, are necessary for this joint utilization. However, underlying both these lines of work is a lack of understanding of the complementarity of ID and modality features. In this work, we address this gap by studying the complementarity of ID- and text-based SR models. We show that these models do learn complementary signals, meaning that either should provide performance gain when used properly alongside the other. Motivated by this, we propose a new SR method that preserves ID-text complementarity through independent model training, then harnesses it through a simple ensembling strategy. Despite this method's simplicity, we show it outperforms several competitive SR baselines, implying that both ID and text features are necessary to achieve state-of-the-art SR performance but complex fusion architectures are not.

Related papers

The Best of the Two Worlds: Harmonizing Semantic and Hash IDs for Sequential Recommendation [51.62815306481903]
We propose textbfname, a novel framework that harmonizes the SID and HID. Specifically, we devise a dual-branch modeling architecture that enables the model to capture both the multi-granular semantics within SID while preserving the unique collaborative identity of HID.<n>Experiments on three real-world datasets show that name balances recommendation quality for both head and tail items while surpassing the existing baselines.
arXiv Detail & Related papers (2025-12-11T07:50:53Z)
Contextualized Multimodal Lifelong Person Re-Identification in Hybrid Clothing States [2.6399783378460158]
Person Re-Identification (ReID) has several challenges in real-world surveillance systems due to clothing changes (CCReID)<n>Previous existing methods either develop models specifically for one application or treat CCReID as its own sub-problem.<n>We will introduce the LReID-Hybrid task with the goal of developing a model to achieve both SC and CC while learning in a continual setting.
arXiv Detail & Related papers (2025-09-14T12:46:39Z)
SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLMs [70.79124435220695]
We propose a novel unified Semantic-enhanced generative Cross-mOdal REtrieval framework (SemCORE)<n>We first construct a Structured natural language IDentifier (SID) that effectively aligns target identifiers with generative models optimized for natural language comprehension and generation.<n>We then introduce a Generative Semantic Verification (GSV) strategy enabling fine-grained target discrimination.
arXiv Detail & Related papers (2025-04-17T17:59:27Z)
ChatReID: Open-ended Interactive Person Retrieval via Hierarchical Progressive Tuning for Vision Language Models [49.09606704563898]
Person re-identification is a crucial task in computer vision, aiming to recognize individuals across non-overlapping camera views.<n>We propose a novel framework ChatReID, that shifts the focus towards a text-side-dominated retrieval paradigm, enabling flexible and interactive re-identification.<n>We introduce a hierarchical progressive tuning strategy, which endows Re-ID ability through three stages of tuning, i.e., from person attribute understanding to fine-grained image retrieval and to multi-modal task reasoning.
arXiv Detail & Related papers (2025-02-27T10:34:14Z)
All in One Framework for Multimodal Re-identification in the Wild [58.380708329455466]
multimodal learning paradigm for ReID introduced, referred to as All-in-One (AIO) AIO harnesses a frozen pre-trained big model as an encoder, enabling effective multimodal retrieval without additional fine-tuning. Experiments on cross-modal and multimodal ReID reveal that AIO not only adeptly handles various modal data but also excels in challenging contexts.
arXiv Detail & Related papers (2024-05-08T01:04:36Z)
ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning [57.91881829308395]
Identity-preserving text-to-image generation (ID-T2I) has received significant attention due to its wide range of application scenarios like AI portrait and advertising. We present textbfID-Aligner, a general feedback learning framework to enhance ID-T2I performance.
arXiv Detail & Related papers (2024-04-23T18:41:56Z)
Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm [31.06269858216316]
We propose Infinite-ID, an ID-semantics decoupling paradigm for identity-preserved personalization. We introduce an identity-enhanced training, incorporating an additional image cross-attention module to capture sufficient ID information. We also introduce a feature interaction mechanism that combines a mixed attention module with an AdaIN-mean operation to seamlessly merge the two streams.
arXiv Detail & Related papers (2024-03-18T13:39:53Z)
Multi-Prompts Learning with Cross-Modal Alignment for Attribute-based Person Re-Identification [18.01407937934588]
We present a new framework called Multi-Prompts ReID (MP-ReID) based on prompt learning and language models. MP-ReID learns to hallucinate diverse, informative, and promptable sentences for describing the query images. Explicit prompts are obtained by ensembling generation models, such as ChatGPT and VQA models.
arXiv Detail & Related papers (2023-12-28T03:00:19Z)
VILLS -- Video-Image Learning to Learn Semantics for Person Re-Identification [51.89551385538251]
We propose VILLS (Video-Image Learning to Learn Semantics), a self-supervised method that jointly learns spatial and temporal features from images and videos. VILLS first designs a local semantic extraction module that adaptively extracts semantically consistent and robust spatial features. Then, VILLS designs a unified feature learning and adaptation module to represent image and video modalities in a consistent feature space.
arXiv Detail & Related papers (2023-11-27T19:30:30Z)
Semantic-aware Video Representation for Few-shot Action Recognition [1.6486717871944268]
We propose a simple yet effective Semantic-Aware Few-Shot Action Recognition (SAFSAR) model to address these issues. We show that directly leveraging a 3D feature extractor combined with an effective feature-fusion scheme, and a simple cosine similarity for classification can yield better performance. Experiments on five challenging few-shot action recognition benchmarks under various settings demonstrate that the proposed SAFSAR model significantly improves the state-of-the-art performance.
arXiv Detail & Related papers (2023-11-10T18:13:24Z)
Self-Sufficient Framework for Continuous Sign Language Recognition [75.60327502570242]
The goal of this work is to develop self-sufficient framework for Continuous Sign Language Recognition. These include the need for complex multi-scale features such as hands, face, and mouth for understanding, and absence of frame-level annotations. We propose Divide and Focus Convolution (DFConv) which extracts both manual and non-manual features without the need for additional networks or annotations. DPLR propagates non-spiky frame-level pseudo-labels by combining the ground truth gloss sequence labels with the predicted sequence.
arXiv Detail & Related papers (2023-03-21T11:42:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.