Enhancing Rating Prediction with Off-the-Shelf LLMs Using In-Context User Reviews
- URL: http://arxiv.org/abs/2510.00449v1
- Date: Wed, 01 Oct 2025 03:04:20 GMT
- Title: Enhancing Rating Prediction with Off-the-Shelf LLMs Using In-Context User Reviews
- Authors: Koki Ryu, Hitomi Yanaka,
- Abstract summary: Likert-scale rating prediction is a regression task that requires both language and mathematical reasoning to be solved effectively.<n>This study investigates the performance of off-the-shelf LLMs on rating prediction, providing different in-context information.<n>We find that user-written reviews significantly improve the rating prediction performance of LLMs.
- Score: 16.394933051332657
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Personalizing the outputs of large language models (LLMs) to align with individual user preferences is an active research area. However, previous studies have mainly focused on classification or ranking tasks and have not considered Likert-scale rating prediction, a regression task that requires both language and mathematical reasoning to be solved effectively. This task has significant industrial applications, but the utilization of LLMs remains underexplored, particularly regarding the capabilities of off-the-shelf LLMs. This study investigates the performance of off-the-shelf LLMs on rating prediction, providing different in-context information. Through comprehensive experiments with eight models across three datasets, we demonstrate that user-written reviews significantly improve the rating prediction performance of LLMs. This result is comparable to traditional methods like matrix factorization, highlighting the potential of LLMs as a promising solution for the cold-start problem. We also find that the reviews for concrete items are more effective than general preference descriptions that are not based on any specific item. Furthermore, we discover that prompting LLMs to first generate a hypothetical review enhances the rating prediction performance. Our code is available at https://github.com/ynklab/rating-prediction-with-reviews.
Related papers
- Nonparametric LLM Evaluation from Preference Data [86.96268870461472]
We propose a nonparametric statistical framework, DMLEval, for comparing and ranking large language models (LLMs) from preference data.<n>Our framework provides practitioners with powerful, state-of-the-art methods for comparing or ranking LLMs.
arXiv Detail & Related papers (2026-01-29T15:00:07Z) - Positional Biases Shift as Inputs Approach Context Window Limits [57.00239097102958]
The LiM effect is strongest when inputs occupy up to 50% of a model's context window.<n>We observe a distance-based bias, where model performance is better when relevant information is closer to the end of the input.
arXiv Detail & Related papers (2025-08-10T20:40:24Z) - Latent Factor Models Meets Instructions: Goal-conditioned Latent Factor Discovery without Task Supervision [50.45597801390757]
Instruct-LF is a goal-oriented latent factor discovery system.<n>It integrates instruction-following ability with statistical models to handle noisy datasets.
arXiv Detail & Related papers (2025-02-21T02:03:08Z) - Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge [84.34545223897578]
Despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility.
We identify 12 key potential biases and propose a new automated bias quantification framework-CALM- which quantifies and analyzes each type of bias in LLM-as-a-Judge.
Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications.
arXiv Detail & Related papers (2024-10-03T17:53:30Z) - FineSurE: Fine-grained Summarization Evaluation using LLMs [22.62504593575933]
FineSurE is a fine-grained evaluator specifically tailored for the summarization task using large language models (LLMs)
It also employs completeness and conciseness criteria, in addition to faithfulness, enabling multi-dimensional assessment.
arXiv Detail & Related papers (2024-07-01T02:20:28Z) - Bayesian Statistical Modeling with Predictors from LLMs [5.5711773076846365]
State of the art large language models (LLMs) have shown impressive performance on a variety of benchmark tasks.
This raises questions about the human-likeness of LLM-derived information.
arXiv Detail & Related papers (2024-06-13T11:33:30Z) - DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.<n>The question of how reliable these evaluators are has emerged as a crucial research question.<n>We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - Large Language Models are Inconsistent and Biased Evaluators [2.136983452580014]
We show that Large Language Models (LLMs) are biased evaluators as they exhibit familiarity bias and show skewed distributions of ratings.
We also found that LLMs are inconsistent evaluators, showing low "inter-sample" agreement and sensitivity to prompt differences that are insignificant to human understanding of text quality.
arXiv Detail & Related papers (2024-05-02T20:42:28Z) - PRE: A Peer Review Based Large Language Model Evaluator [14.585292530642603]
Existing paradigms rely on either human annotators or model-based evaluators to evaluate the performance of LLMs.
We propose a novel framework that can automatically evaluate LLMs through a peer-review process.
arXiv Detail & Related papers (2024-01-28T12:33:14Z) - On Learning to Summarize with Large Language Models as References [101.79795027550959]
Large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets.
We study an LLM-as-reference learning setting for smaller text summarization models to investigate whether their performance can be substantially improved.
arXiv Detail & Related papers (2023-05-23T16:56:04Z) - Do LLMs Understand User Preferences? Evaluating LLMs On User Rating
Prediction [15.793007223588672]
Large Language Models (LLMs) have demonstrated exceptional capabilities in generalizing to new tasks in a zero-shot or few-shot manner.
We investigate various LLMs in different sizes, ranging from 250M to 540B parameters and evaluate their performance in zero-shot, few-shot, and fine-tuning scenarios.
arXiv Detail & Related papers (2023-05-10T21:43:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.