Related papers: Cross-attention Secretly Performs Orthogonal Alignment in Recommendation Models

Cross-attention Secretly Performs Orthogonal Alignment in Recommendation Models

URL: http://arxiv.org/abs/2510.09435v1
Date: Fri, 10 Oct 2025 14:45:39 GMT
Title: Cross-attention Secretly Performs Orthogonal Alignment in Recommendation Models
Authors: Hyunin Lee, Yong Zhang, Hoang Vu Nguyen, Xiaoyi Liu, Namyong Park, Christopher Jung, Rong Jin, Yang Wang, Zhigang Wang, Somayeh Sojoudi, Xue Feng,
Abstract summary: Cross-domain sequential recommendation aims to align heterogeneous user behavior sequences collected from different domains.<n>Most researchers interpret cross-attention as residual alignment, where the output is generated by removing redundant and preserving non-redundant information.<n>We introduce Orthogonal Alignment, a phenomenon in which cross-attention discovers novel information that is not present in the query input.
Score: 32.476422580370375
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Cross-domain sequential recommendation (CDSR) aims to align heterogeneous user behavior sequences collected from different domains. While cross-attention is widely used to enhance alignment and improve recommendation performance, its underlying mechanism is not fully understood. Most researchers interpret cross-attention as residual alignment, where the output is generated by removing redundant and preserving non-redundant information from the query input by referencing another domain data which is input key and value. Beyond the prevailing view, we introduce Orthogonal Alignment, a phenomenon in which cross-attention discovers novel information that is not present in the query input, and further argue that those two contrasting alignment mechanisms can co-exist in recommendation models We find that when the query input and output of cross-attention are orthogonal, model performance improves over 300 experiments. Notably, Orthogonal Alignment emerges naturally, without any explicit orthogonality constraints. Our key insight is that Orthogonal Alignment emerges naturally because it improves scaling law. We show that baselines additionally incorporating cross-attention module outperform parameter-matched baselines, achieving a superior accuracy-per-model parameter. We hope these findings offer new directions for parameter-efficient scaling in multi-modal research.

Related papers

Inference-time Alignment via Sparse Junction Steering [25.464612964225484]
Token-level steering has emerged as a pivotal approach for inference-time alignment.<n>Existing methods rely on dense intervention at every decoding step.<n>We show that dense intervention is unnecessary and propose sparse junction steering.
arXiv Detail & Related papers (2026-01-30T08:40:47Z)
OrthAlign: Orthogonal Subspace Decomposition for Non-Interfering Multi-Objective Alignment [61.02595549125661]
Large language model (LLM) alignment faces a critical dilemma when addressing multiple human preferences.<n>We present OrthAlign, an innovative approach to resolve gradient-level conflicts in preference alignment.<n>We show that OrthAlign achieves maximum single-preference improvements ranging from 34.61% to 50.89% after multiple-objective alignment.
arXiv Detail & Related papers (2025-09-29T11:16:30Z)
ERIS: An Energy-Guided Feature Disentanglement Framework for Out-of-Distribution Time Series Classification [51.07970070817353]
An ideal time series classification (TSC) should be able to capture invariant representations.<n>Current methods are largely unguided, lacking the semantic direction required to isolate truly universal features.<n>We propose an end-to-end Energy-Regularized Information for Shift-Robustness framework to enable guided and reliable feature disentanglement.
arXiv Detail & Related papers (2025-08-19T12:13:41Z)
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
We conduct experiments to investigate gating-augmented softmax attention variants.<n>We find that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance.
arXiv Detail & Related papers (2025-05-10T17:15:49Z)
RAU: Towards Regularized Alignment and Uniformity for Representation Learning in Recommendation [7.193305599721105]
We propose Regularized Alignment and Uniformity (RAU) to cope with sparse alignment and uneven uniformity issues.<n>RAU consists of two novel regularization methods for alignment and uniformity to learn better user/item representation.
arXiv Detail & Related papers (2025-03-24T03:03:21Z)
Breaking Determinism: Fuzzy Modeling of Sequential Recommendation Using Discrete State Space Diffusion Model [66.91323540178739]
Sequential recommendation (SR) aims to predict items that users may be interested in based on their historical behavior. We revisit SR from a novel information-theoretic perspective and find that sequential modeling methods fail to adequately capture randomness and unpredictability of user behavior. Inspired by fuzzy information processing theory, this paper introduces the fuzzy sets of interaction sequences to overcome the limitations and better capture the evolution of users' real interests.
arXiv Detail & Related papers (2024-10-31T14:52:01Z)
Long-Sequence Recommendation Models Need Decoupled Embeddings [49.410906935283585]
We identify and characterize a neglected deficiency in existing long-sequence recommendation models.<n>A single set of embeddings struggles with learning both attention and representation, leading to interference between these two processes.<n>We propose the Decoupled Attention and Representation Embeddings (DARE) model, where two distinct embedding tables are learned separately to fully decouple attention and representation.
arXiv Detail & Related papers (2024-10-03T15:45:15Z)
Sequential Recommendation via Adaptive Robust Attention with Multi-dimensional Embeddings [7.207685588038045]
Sequential recommendation models have achieved state-of-the-art performance using self-attention mechanism. Moving beyond only using item ID and positional embeddings leads to a significant accuracy boost when predicting the next item. We introduce a mix-attention mechanism with a layer-wise noise injection (LNI) regularization to improve the model's robustness and generalization.
arXiv Detail & Related papers (2024-09-08T08:27:22Z)
Visual Prompt Tuning in Null Space for Continual Learning [51.96411454304625]
Existing prompt-tuning methods have demonstrated impressive performances in continual learning (CL) This paper aims to learn each task by tuning the prompts in the direction orthogonal to the subspace spanned by previous tasks' features. In practice, an effective null-space-based approximation solution has been proposed to implement the prompt gradient projection.
arXiv Detail & Related papers (2024-06-09T05:57:40Z)
Focus the Discrepancy: Intra- and Inter-Correlation Learning for Image Anomaly Detection [13.801572236048601]
FOcus-the-Discrepancy (FOD) can simultaneously spot the patch-wise, intra- and inter-discrepancies of anomalies. In this paper, we propose a novel AD framework: FOcus-the-Discrepancy (FOD), which can simultaneously spot the patch-wise, intra- and inter-discrepancies of anomalies.
arXiv Detail & Related papers (2023-08-06T01:30:26Z)
Attention that does not Explain Away [54.42960937271612]
Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks. A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances. We propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the "explaining away" effect.
arXiv Detail & Related papers (2020-09-29T21:05:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.