Related papers: CM$^3$: Calibrating Multimodal Recommendation

CM$^3$: Calibrating Multimodal Recommendation

URL: http://arxiv.org/abs/2508.01226v1
Date: Sat, 02 Aug 2025 06:44:59 GMT
Title: CM$^3$: Calibrating Multimodal Recommendation
Authors: Xin Zhou, Yongjie Wang, Zhiqi Shen,
Abstract summary: This study revisits the alignment and uniformity properties within the context of multimodal recommender systems.<n>We propose a more nuanced approach wherein items with similar multimodal attributes converge toward proximal representations within the hyperspheric manifold.<n>We also introduce a Spherical B'ezier method designed to integrate an arbitrary number of modalities while ensuring that the resulting fused features are constrained to the same hyperspherical manifold.
Score: 10.09576389984858
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Alignment and uniformity are fundamental principles within the domain of contrastive learning. In recommender systems, prior work has established that optimizing the Bayesian Personalized Ranking (BPR) loss contributes to the objectives of alignment and uniformity. Specifically, alignment aims to draw together the representations of interacting users and items, while uniformity mandates a uniform distribution of user and item embeddings across a unit hypersphere. This study revisits the alignment and uniformity properties within the context of multimodal recommender systems, revealing a proclivity among extant models to prioritize uniformity to the detriment of alignment. Our hypothesis challenges the conventional assumption of equitable item treatment through a uniformity loss, proposing a more nuanced approach wherein items with similar multimodal attributes converge toward proximal representations within the hyperspheric manifold. Specifically, we leverage the inherent similarity between items' multimodal data to calibrate their uniformity distribution, thereby inducing a more pronounced repulsive force between dissimilar entities within the embedding space. A theoretical analysis elucidates the relationship between this calibrated uniformity loss and the conventional uniformity function. Moreover, to enhance the fusion of multimodal features, we introduce a Spherical B\'ezier method designed to integrate an arbitrary number of modalities while ensuring that the resulting fused features are constrained to the same hyperspherical manifold. Empirical evaluations conducted on five real-world datasets substantiate the superiority of our approach over competing baselines. We also shown that the proposed methods can achieve up to a 5.4% increase in NDCG@20 performance via the integration of MLLM-extracted features. Source code is available at: https://github.com/enoche/CM3.

Related papers

Functionality-Oriented LLM Merging on the Fisher--Rao Manifold [14.349284217707575]
Weight-space merging aims to combine multiple fine-tuned LLMs into a single model without retraining.<n>We derive a practical fixed-point algorithm using a lightweight spherical proxy that preserves norms and generalizes directly to multi-expert merging.
arXiv Detail & Related papers (2026-03-05T09:08:38Z)
Universal NP-Hardness of Clustering under General Utilities [11.62669179647184]
We formalise the common optimisation core motivating a diverse-time computable partition utility over a finite metric space.<n>By mapping ten major paradigms -- including k-means, GMMs, DBSCAN, spectral clustering, and affinity propagation -- to the UCP framework, we demonstrate that each inherits this fundamental inability.
arXiv Detail & Related papers (2026-02-27T13:08:15Z)
Stability and Generalization of Push-Sum Based Decentralized Optimization over Directed Graphs [55.77845440440496]
Push-based decentralized communication enables optimization over communication networks, where information exchange may be asymmetric.<n>We develop a unified uniform-stability framework for the Gradient Push (SGP) algorithm.<n>A key technical ingredient is an imbalance-aware generalization bound through two quantities.
arXiv Detail & Related papers (2026-02-24T05:32:03Z)
Explainable Multimodal Regression via Information Decomposition [27.157278306251772]
We propose a novel multimodal regression framework grounded in Partial Information Decomposition (PID)<n>Our framework outperforms state-of-the-art methods in both predictive accuracy and interpretability, while also enabling informed modality selection for efficient inference.
arXiv Detail & Related papers (2025-12-26T18:07:18Z)
Amplifying Prominent Representations in Multimodal Learning via Variational Dirichlet Process [55.91649771370862]
Dirichlet process (DP) mixture model is a powerful non-parametric method that can amplify the most prominent features.<n>We propose a new DP-driven multimodal learning framework that automatically achieves an optimal balance between prominent intra-modal representation learning and cross-modal alignment.
arXiv Detail & Related papers (2025-10-23T16:53:24Z)
Principled Multimodal Representation Learning [70.60542106731813]
Multimodal representation learning seeks to create a unified representation space by integrating diverse data modalities.<n>Recent advances have investigated the simultaneous alignment of multiple modalities, yet several challenges remain.<n>We propose Principled Multimodal Representation Learning (PMRL), a novel framework that achieves simultaneous alignment of multiple modalities.
arXiv Detail & Related papers (2025-07-23T09:12:25Z)
RAU: Towards Regularized Alignment and Uniformity for Representation Learning in Recommendation [7.193305599721105]
We propose Regularized Alignment and Uniformity (RAU) to cope with sparse alignment and uneven uniformity issues.<n>RAU consists of two novel regularization methods for alignment and uniformity to learn better user/item representation.
arXiv Detail & Related papers (2025-03-24T03:03:21Z)
DecAlign: Hierarchical Cross-Modal Alignment for Decoupled Multimodal Representation Learning [7.947217265041953]
Multimodal representation learning aims to capture both shared and complementary semantic information across multiple modalities.<n>We introduce DecAlign, a novel hierarchical cross-modal alignment framework designed to decouple multimodal representations into modality-unique (heterogeneous) and modality-common (homogeneous) features.<n>Our experiments on four widely used multimodal benchmarks demonstrate that DecAlign consistently outperforms existing state-of-the-art methods.
arXiv Detail & Related papers (2025-03-14T21:47:48Z)
Disentangled Interleaving Variational Encoding [1.132458063021286]
We propose a principled approach to disentangle the original input into marginal and conditional probability distributions in the latent space of a variational autoencoder.<n>Our proposed model, Deep Disentangled Interleaving Variational.<n>coder (DeepDIVE), learns disentangled features from the original input to form clusters in the embedding space.<n>Experiments on two public datasets show that DeepDIVE disentangles the original input and yields forecast accuracies better than the original VAE.
arXiv Detail & Related papers (2025-01-15T10:50:54Z)
Generalization Bounds of Surrogate Policies for Combinatorial Optimization Problems [53.03951222945921]
We analyze smoothed (perturbed) policies, adding controlled random perturbations to the direction used by the linear oracle.<n>Our main contribution is a generalization bound that decomposes the excess risk into perturbation bias, statistical estimation error, and optimization error.<n>We illustrate the scope of the results on applications such as vehicle scheduling, highlighting how smoothing enables both tractable training and controlled generalization.
arXiv Detail & Related papers (2024-07-24T12:00:30Z)
Collaborative Heterogeneous Causal Inference Beyond Meta-analysis [68.4474531911361]
We propose a collaborative inverse propensity score estimator for causal inference with heterogeneous data. Our method shows significant improvements over the methods based on meta-analysis when heterogeneity increases.
arXiv Detail & Related papers (2024-04-24T09:04:36Z)
Synergistic eigenanalysis of covariance and Hessian matrices for enhanced binary classification [72.77513633290056]
We present a novel approach that combines the eigenanalysis of a covariance matrix evaluated on a training set with a Hessian matrix evaluated on a deep learning model. Our method captures intricate patterns and relationships, enhancing classification performance.
arXiv Detail & Related papers (2024-02-14T16:10:42Z)
Deep Diversity-Enhanced Feature Representation of Hyperspectral Images [87.47202258194719]
We rectify 3D convolution by modifying its topology to enhance the rank upper-bound. We also propose a novel diversity-aware regularization (DA-Reg) term that acts on the feature maps to maximize independence among elements. To demonstrate the superiority of the proposed Re$3$-ConvSet and DA-Reg, we apply them to various HS image processing and analysis tasks.
arXiv Detail & Related papers (2023-01-15T16:19:18Z)
Geodesic Multi-Modal Mixup for Robust Fine-Tuning [21.298732743643168]
We show that CLIP retains poor uniformity and alignment even after fine-tuning. We propose a Geodesic Multi-Modal Mixup that mixes the embeddings of image and text to generate hard negative samples. Our method provides transferable representations, enabling robust model adaptation on diverse tasks.
arXiv Detail & Related papers (2022-03-08T07:34:52Z)
A Unified Framework for Multi-distribution Density Ratio Estimation [101.67420298343512]
Binary density ratio estimation (DRE) provides the foundation for many state-of-the-art machine learning algorithms. We develop a general framework from the perspective of Bregman minimization divergence. We show that our framework leads to methods that strictly generalize their counterparts in binary DRE.
arXiv Detail & Related papers (2021-12-07T01:23:20Z)
Trustworthy Multimodal Regression with Mixture of Normal-inverse Gamma Distributions [91.63716984911278]
We introduce a novel Mixture of Normal-Inverse Gamma distributions (MoNIG) algorithm, which efficiently estimates uncertainty in principle for adaptive integration of different modalities and produces a trustworthy regression result. Experimental results on both synthetic and different real-world data demonstrate the effectiveness and trustworthiness of our method on various multimodal regression tasks.
arXiv Detail & Related papers (2021-11-11T14:28:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.