Less is More: Benchmarking LLM Based Recommendation Agents
- URL: http://arxiv.org/abs/2601.20316v1
- Date: Wed, 28 Jan 2026 07:08:51 GMT
- Title: Less is More: Benchmarking LLM Based Recommendation Agents
- Authors: Kargi Chauhan, Mahalakshmi Venkateswarlu,
- Abstract summary: Large Language Models (LLMs) are increasingly deployed for personalized product recommendations.<n>We challenge this assumption through a systematic benchmark of four state of the art LLMs.<n>Experiments with 50 users in a within subject design reveal no significant quality improvement with increased context length.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are increasingly deployed for personalized product recommendations, with practitioners commonly assuming that longer user purchase histories lead to better predictions. We challenge this assumption through a systematic benchmark of four state of the art LLMs GPT-4o-mini, DeepSeek-V3, Qwen2.5-72B, and Gemini 2.5 Flash across context lengths ranging from 5 to 50 items using the REGEN dataset. Surprisingly, our experiments with 50 users in a within subject design reveal no significant quality improvement with increased context length. Quality scores remain flat across all conditions (0.17--0.23). Our findings have significant practical implications: practitioners can reduce inference costs by approximately 88\% by using context (5--10 items) instead of longer histories (50 items), without sacrificing recommendation quality. We also analyze latency patterns across providers and find model specific behaviors that inform deployment decisions. This work challenges the existing ``more context is better'' paradigm and provides actionable guidelines for cost effective LLM based recommendation systems.
Related papers
- Benchmarking and Improving LLM Robustness for Personalized Generation [42.26075952121524]
We define a model as robust if its responses are both factually accurate and align with the user preferences.<n>Our work highlights critical gaps in current evaluation practices and introduces tools and metrics to support more reliable, user-aligned deployments.
arXiv Detail & Related papers (2025-09-18T13:56:14Z) - Can LLMs Outshine Conventional Recommenders? A Comparative Evaluation [33.031903907256606]
We introduce RecBench, which evaluates two primary recommendation tasks, i.e., click-through rate prediction (CTR) and sequential recommendation (SeqRec)<n>Our experiments cover up to 17 large models and are conducted across five diverse datasets from fashion, news, video, books, and music domains.<n>Our findings indicate that LLM-based recommenders outperform conventional recommenders, achieving up to a 5% AUC improvement in the CTR scenario and up to a 170% NDCG@10 improvement in the SeqRec scenario.
arXiv Detail & Related papers (2025-03-07T15:05:23Z) - EACO: Enhancing Alignment in Multimodal LLMs via Critical Observation [58.546205554954454]
We propose Enhancing Alignment in MLLMs via Critical Observation (EACO)<n>EACO aligns MLLMs by self-generated preference data using only 5k images economically.<n>EACO reduces the overall hallucinations by 65.6% on HallusionBench and improves the reasoning ability by 21.8% on MME-Cognition.
arXiv Detail & Related papers (2024-12-06T09:59:47Z) - Large Language Models Can Self-Improve in Long-context Reasoning [100.52886241070907]
Large language models (LLMs) have achieved substantial progress in processing long contexts but still struggle with long-context reasoning.
We propose ours, an approach specifically designed for this purpose.
ours achieves superior performance compared to prior approaches that depend on data produced by human experts or advanced models.
arXiv Detail & Related papers (2024-11-12T19:53:00Z) - Beyond Utility: Evaluating LLM as Recommender [47.97889161958022]
We explore four new evaluation dimensions and propose a multidimensional evaluation framework.
New evaluation dimensions include: history length sensitivity, candidate position bias, 3) generation-involved performance, and 4) hallucinations.
Using this multidimensional evaluation framework, along with traditional aspects, we evaluate the performance of seven LLM-based recommenders.
arXiv Detail & Related papers (2024-11-01T03:09:28Z) - GIVE: Structured Reasoning of Large Language Models with Knowledge Graph Inspired Veracity Extrapolation [108.2008975785364]
Graph Inspired Veracity Extrapolation (GIVE) is a novel reasoning method that merges parametric and non-parametric memories to improve accurate reasoning with minimal external input.<n>GIVE guides the LLM agent to select the most pertinent expert data (observe), engage in query-specific divergent thinking (reflect), and then synthesize this information to produce the final output (speak)
arXiv Detail & Related papers (2024-10-11T03:05:06Z) - How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship.
We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3.
While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z) - Uncertainty-Aware Explainable Recommendation with Large Language Models [15.229417987212631]
We develop a model that utilizes the ID vectors of user and item inputs as prompts for GPT-2.
We employ a joint training mechanism within a multi-task learning framework to optimize both the recommendation task and explanation task.
Our method achieves 1.59 DIV, 0.57 USR and 0.41 FCR on the Yelp, TripAdvisor and Amazon dataset respectively.
arXiv Detail & Related papers (2024-01-31T14:06:26Z) - What Are We Optimizing For? A Human-centric Evaluation of Deep Learning-based Movie Recommenders [12.132920692489911]
We conduct a human-centric evaluation case study of four leading DL-RecSys models in the movie domain.
We test how different DL-RecSys models perform in personalized recommendation generation by conducting survey study with 445 real active users.
We find some DL-RecSys models to be superior in recommending novel and unexpected items and weaker in diversity, trustworthiness, transparency, accuracy, and overall user satisfaction.
arXiv Detail & Related papers (2024-01-21T23:56:57Z) - L-Eval: Instituting Standardized Evaluation for Long Context Language
Models [91.05820785008527]
We propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs)
We build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs.
Results show that popular n-gram matching metrics generally can not correlate well with human judgment.
arXiv Detail & Related papers (2023-07-20T17:59:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.