Related papers: On Softmax Direct Preference Optimization for Recommendation

On Softmax Direct Preference Optimization for Recommendation

URL: http://arxiv.org/abs/2406.09215v3
Date: Thu, 07 Nov 2024 18:30:53 GMT
Title: On Softmax Direct Preference Optimization for Recommendation
Authors: Yuxin Chen, Junfei Tan, An Zhang, Zhengyi Yang, Leheng Sheng, Enzhi Zhang, Xiang Wang, Tat-Seng Chua,
Abstract summary: We propose Softmax-DPO (S-DPO) to instill ranking information into the LM to help LM-based recommenders distinguish preferred items from negatives. Specifically, we incorporate multiple negatives in user preference data and devise an alternative version of DPO loss tailored for LM-based recommenders.
Score: 50.896117978746
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recommender systems aim to predict personalized rankings based on user preference data. With the rise of Language Models (LMs), LM-based recommenders have been widely explored due to their extensive world knowledge and powerful reasoning abilities. Most of the LM-based recommenders convert historical interactions into language prompts, pairing with a positive item as the target response and fine-tuning LM with a language modeling loss. However, the current objective fails to fully leverage preference data and is not optimized for personalized ranking tasks, which hinders the performance of LM-based recommenders. Inspired by the current advancement of Direct Preference Optimization (DPO) in human preference alignment and the success of softmax loss in recommendations, we propose Softmax-DPO (S-DPO) to instill ranking information into the LM to help LM-based recommenders distinguish preferred items from negatives, rather than solely focusing on positives. Specifically, we incorporate multiple negatives in user preference data and devise an alternative version of DPO loss tailored for LM-based recommenders, which is extended from the traditional full-ranking Plackett-Luce (PL) model to partial rankings and connected to softmax sampling strategies. Theoretically, we bridge S-DPO with the softmax loss over negative sampling and find that it has an inherent benefit of mining hard negatives, which assures its exceptional capabilities in recommendation tasks. Empirically, extensive experiments conducted on three real-world datasets demonstrate the superiority of S-DPO to effectively model user preference and further boost recommendation performance while providing better rewards for preferred items. Our codes are available at https://github.com/chenyuxin1999/S-DPO.

Related papers

Evaluating Position Bias in Large Language Model Recommendations [3.430780143519032]
Large Language Models (LLMs) are being increasingly explored as general-purpose tools for recommendation tasks.<n>We show that LLM-based recommendation models suffer from position bias, where the order of candidate items in a prompt can disproportionately influence the recommendations produced by LLMs.<n>We introduce a new prompting strategy to mitigate the position bias of LLM recommendation models called Ranking via Iterative SElection.
arXiv Detail & Related papers (2025-08-04T03:30:26Z)
Refine-POI: Reinforcement Fine-Tuned Large Language Models for Next Point-of-Interest Recommendation [51.08869388483333]
Large language models (LLMs) have been adopted for next point-of-interest (POI) recommendation tasks.<n>We propose Refine-POI, a reinforcement fine-tuning framework for next POI recommendation.
arXiv Detail & Related papers (2025-06-19T02:51:10Z)
MPPO: Multi Pair-wise Preference Optimization for LLMs with Arbitrary Negative Samples [22.521746860874305]
This study introduces the MPPO algorithm, which leverages the average likelihood of model responses to fit the reward function. Through a comparison of Point-wise, Pair-wise, and List-wise implementations, we found that the Pair-wise approach achieves the best performance. Experimental results demonstrate MPPO's outstanding performance across various benchmarks.
arXiv Detail & Related papers (2024-12-13T14:18:58Z)
Preference Diffusion for Recommendation [50.8692409346126]
We propose PreferDiff, a tailored optimization objective for DM-based recommenders. PreferDiff transforms BPR into a log-likelihood ranking objective to better capture user preferences. It is the first personalized ranking loss designed specifically for DM-based recommenders.
arXiv Detail & Related papers (2024-10-17T01:02:04Z)
RosePO: Aligning LLM-based Recommenders with Human Values [38.029251417802044]
We propose a general framework -- Recommendation with smoothing personalized Preference Optimization (RosePO) RosePO better aligns with customized human values during the post-training stage. Evaluation on three real-world datasets demonstrates the effectiveness of our method.
arXiv Detail & Related papers (2024-10-16T12:54:34Z)
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [56.24431208419858]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset. We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.
arXiv Detail & Related papers (2024-10-10T16:01:51Z)
Direct Preference Optimization for LLM-Enhanced Recommendation Systems [33.54698201942643]
Large Language Models (LLMs) have exhibited remarkable performance across a wide range of domains. We propose DPO4Rec, a framework that integrates DPO into LLM-enhanced recommendation systems. Extensive experiments show that DPO4Rec significantly improves re-ranking performance over strong baselines.
arXiv Detail & Related papers (2024-10-08T11:42:37Z)
TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization With Estimated Weights [73.9088920210495]
We propose a token-level importance sampling DPO objective named TIS-DPO that assigns importance weights to each token based on its reward. TIS-DPO significantly outperforms various baseline methods on harmlessness and helpfulness alignment and summarization tasks.
arXiv Detail & Related papers (2024-10-06T04:03:00Z)
ASFT: Aligned Supervised Fine-Tuning through Absolute Likelihood [14.512464277772194]
Aligned Supervised Fine-Tuning (ASFT) is an effective approach that better aligns Large Language Models with pair-wise datasets. ASFT mitigates the issue where the DPO loss function decreases the probability of generating human-dispreferred data. Extensive experiments demonstrate that ASFT is an effective alignment approach, consistently outperforming existing methods.
arXiv Detail & Related papers (2024-09-14T11:39:13Z)
Spread Preference Annotation: Direct Preference Judgment for Efficient LLM Alignment [72.99676237703099]
We propose a new framework that boosts the alignment of large language models with human preferences. Our key idea is leveraging the human prior knowledge within the small (seed) data. We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
arXiv Detail & Related papers (2024-06-06T18:01:02Z)
Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data. Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z)
Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models. The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models. Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z)
Statistical Rejection Sampling Improves Preference Optimization [42.57245965632205]
We introduce a novel approach to source preference data from the target optimal policy using rejection sampling. We also propose a unified framework that enhances the loss functions used in both Sequence Likelihood (SLiC) and Direct Preference Optimization (DPO) from a preference modeling standpoint.
arXiv Detail & Related papers (2023-09-13T01:07:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.