Prompting is Not All You Need! Evaluating LLM Agent Simulation Methodologies with Real-World Online Customer Behavior Data
- URL: http://arxiv.org/abs/2503.20749v6
- Date: Thu, 05 Jun 2025 21:33:15 GMT
- Title: Prompting is Not All You Need! Evaluating LLM Agent Simulation Methodologies with Real-World Online Customer Behavior Data
- Authors: Yuxuan Lu, Jing Huang, Yan Han, Bingsheng Yao, Sisong Bei, Jiri Gesi, Yaochen Xie, Zheshen, Wang, Qi He, Dakuo Wang,
- Abstract summary: We focus on evaluating LLM's objective accuracy'' rather than the subjective believability'' in simulating human behavior.<n>We present the first comprehensive evaluation of state-of-the-art LLMs on the task of web shopping action generation.
- Score: 62.61900377170456
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research shows that LLMs can simulate ``believable'' human behaviors to power LLM agents via prompt-only methods. In this work, we focus on evaluating LLM's objective ``accuracy'' rather than the subjective ``believability'' in simulating human behavior, leveraging a large-scale, real-world dataset collected from customers' online shopping actions. We present the first comprehensive evaluation of state-of-the-art LLMs (e.g., DeepSeek-R1, Llama, and Claude) on the task of web shopping action generation. Our results show that out-of-the-box LLM-generated actions are often misaligned with actual human behavior, whereas fine-tuning LLMs on real-world behavioral data substantially improves their ability to generate accurate actions compared to prompt-only methods. Furthermore, incorporating synthesized reasonings into model training leads to additional performance gains, demonstrating the value of explicit rationale in behavior modeling. This work evaluates state-of-the-art LLMs in behavior simulation and provides actionable insights into how real-world action data can enhance the fidelity of LLM agents.
Related papers
- Improving LLM Agent Planning with In-Context Learning via Atomic Fact Augmentation and Lookahead Search [48.348209577994865]
Large Language Models (LLMs) are increasingly capable but often require significant guidance or extensive interaction history to perform effectively in complex, interactive environments.<n>We introduce a novel LLM agent framework that enhances planning capabilities through in-context learning.<n>Our agent learns to extract task-critical atomic facts'' from its interaction trajectories.
arXiv Detail & Related papers (2025-06-10T18:36:31Z) - Can LLMs Simulate Personas with Reversed Performance? A Benchmark for Counterfactual Instruction Following [12.668201122427101]
Large Language Models (LLMs) are increasingly widely used to simulate personas in virtual environments.<n>We show that even state-of-the-art LLMs cannot simulate personas with reversed performance.
arXiv Detail & Related papers (2025-04-08T22:00:32Z) - Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.
However, they still struggle with problems requiring multi-step decision-making and environmental feedback.
We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z) - Can LLMs Simulate Social Media Engagement? A Study on Action-Guided Response Generation [51.44040615856536]
This paper analyzes large language models' ability to simulate social media engagement through action guided response generation.
We benchmark GPT-4o-mini, O1-mini, and DeepSeek-R1 in social media engagement simulation regarding a major societal event.
arXiv Detail & Related papers (2025-02-17T17:43:08Z) - Clear Minds Think Alike: What Makes LLM Fine-tuning Robust? A Study of Token Perplexity [61.48338027901318]
We show that fine-tuning with LLM-generated data improves target task performance and reduces out-of-domain degradation.
This is the first mechanistic explanation for the superior OOD robustness conferred by LLM-generated training data.
arXiv Detail & Related papers (2025-01-24T08:18:56Z) - Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL [7.988692259455583]
Large language models (LLMs) trained with Reinforcement Learning from Human Feedback have demonstrated remarkable capabilities, but their underlying reward functions and decision-making processes remain opaque.
This paper introduces a novel approach to interpreting LLMs by applying inverse reinforcement learning (IRL) to recover their implicit reward functions.
We conduct experiments on toxicity-aligned LLMs of varying sizes, extracting reward models that achieve up to 80.40% accuracy in predicting human preferences.
arXiv Detail & Related papers (2024-10-16T12:14:25Z) - Can LLMs Reliably Simulate Human Learner Actions? A Simulation Authoring Framework for Open-Ended Learning Environments [1.4999444543328293]
Simulating learner actions helps stress-test open-ended interactive learning environments and prototype new adaptations before deployment.
We propose Hyp-Mix, a simulation authoring framework that allows experts to develop and evaluate simulations by combining testable hypotheses about learner behavior.
arXiv Detail & Related papers (2024-10-03T00:25:40Z) - On the limits of agency in agent-based models [13.130587222524305]
Agent-based modeling offers powerful insights into complex systems, but its practical utility has been limited by computational constraints.
Recent advancements in large language models (LLMs) could enhance ABMs with adaptive agents, but their integration into large-scale simulations remains challenging.
We present LLM archetypes, a technique that balances behavioral complexity with computational efficiency, allowing for nuanced agent behavior in large-scale simulations.
arXiv Detail & Related papers (2024-09-14T04:17:24Z) - Knowledgeable Agents by Offline Reinforcement Learning from Large Language Model Rollouts [10.929547354171723]
This paper introduces Knowledgeable Agents from Language Model Rollouts (KALM)
It extracts knowledge from large language models (LLMs) in the form of imaginary rollouts that can be easily learned by the agent through offline reinforcement learning methods.
It achieves a success rate of 46% in executing tasks with unseen goals, substantially surpassing the 26% success rate achieved by baseline methods.
arXiv Detail & Related papers (2024-04-14T13:19:40Z) - Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach [64.42462708687921]
Evaluations have revealed that factors such as scaling, training types, architectures and other factors profoundly impact the performance of LLMs.
Our study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods.
This includes the application of ANOVA, Tukey HSD tests, GAMM, and clustering technique.
arXiv Detail & Related papers (2024-03-22T14:47:35Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - Systematic Biases in LLM Simulations of Debates [12.933509143906141]
We study the limitations of Large Language Models in simulating human interactions.<n>Our findings indicate a tendency for LLM agents to conform to the model's inherent social biases.<n>These results underscore the need for further research to develop methods that help agents overcome these biases.
arXiv Detail & Related papers (2024-02-06T14:51:55Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z) - How Far Are LLMs from Believable AI? A Benchmark for Evaluating the Believability of Human Behavior Simulation [46.42384207122049]
We design SimulateBench to evaluate the believability of large language models (LLMs) when simulating human behaviors.
Based on SimulateBench, we evaluate the performances of 10 widely used LLMs when simulating characters.
arXiv Detail & Related papers (2023-12-28T16:51:11Z) - On Learning to Summarize with Large Language Models as References [101.79795027550959]
Large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets.
We study an LLM-as-reference learning setting for smaller text summarization models to investigate whether their performance can be substantially improved.
arXiv Detail & Related papers (2023-05-23T16:56:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.