A User-Centric Benchmark for Evaluating Large Language Models
- URL: http://arxiv.org/abs/2404.13940v2
- Date: Tue, 23 Apr 2024 01:58:24 GMT
- Title: A User-Centric Benchmark for Evaluating Large Language Models
- Authors: Jiayin Wang, Fengran Mo, Weizhi Ma, Peijie Sun, Min Zhang, Jian-Yun Nie,
- Abstract summary: Large Language Models (LLMs) are essential tools to collaborate with users on different tasks.
Few benchmarks have covered the intended utilization of LLMs by real users.
We benchmark 10 LLM services on their efficacy in satisfying user needs.
- Score: 39.538411572029005
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are essential tools to collaborate with users on different tasks. Evaluating their performance to serve users' needs in real-world scenarios is important. While many benchmarks have been created, they mainly focus on specific predefined model abilities. Few have covered the intended utilization of LLMs by real users. To address this oversight, we propose benchmarking LLMs from a user perspective in both dataset construction and evaluation designs. We first collect 1846 real-world use cases with 15 LLMs from a user study with 712 participants from 23 countries. These self-reported cases form the User Reported Scenarios(URS) dataset with a categorization of 7 user intents. Secondly, on this authentic multi-cultural dataset, we benchmark 10 LLM services on their efficacy in satisfying user needs. Thirdly, we show that our benchmark scores align well with user-reported experience in LLM interactions across diverse intents, both of which emphasize the overlook of subjective scenarios. In conclusion, our study proposes to benchmark LLMs from a user-centric perspective, aiming to facilitate evaluations that better reflect real user needs. The benchmark dataset and code are available at https://github.com/Alice1998/URS.
Related papers
- Aligning LLMs with Individual Preferences via Interaction [51.72200436159636]
We train large language models (LLMs) that can ''interact to align''
We develop a multi-turn preference dataset containing 3K+ multi-turn conversations in tree structures.
For evaluation, we establish the ALOE benchmark, consisting of 100 carefully selected examples and well-designed metrics to measure the customized alignment performance during conversations.
arXiv Detail & Related papers (2024-10-04T17:48:29Z) - Are Large Language Models In-Context Personalized Summarizers? Get an iCOPERNICUS Test Done! [14.231110627461]
Large Language Models (LLMs) have succeeded considerably in In-Context-Learning (ICL) based summarization.
We propose a novel In-COntext PERsonalization learNIng sCrUtiny of Summarization capability in LLMs that uses EGISES as a comparative measure.
We evaluate 17 state-of-the-art LLMs based on their reported ICL performances and observe that 15 models' ICPL degrades when probed with richer prompts.
arXiv Detail & Related papers (2024-09-30T18:45:00Z) - PersonalLLM: Tailoring LLMs to Individual Preferences [11.717169516971856]
We present a public benchmark, PersonalLLM, focusing on adapting LLMs to provide maximal benefits for a particular user.
We curate open-ended prompts paired with many high-quality answers over which users would be expected to display heterogeneous latent preferences.
Our dataset and generated personalities offer an innovative testbed for developing personalization algorithms.
arXiv Detail & Related papers (2024-09-30T13:55:42Z) - TALEC: Teach Your LLM to Evaluate in Specific Domain with In-house Criteria by Criteria Division and Zero-shot Plus Few-shot [2.186726107112913]
We propose a model-based evaluation method: TALEC.
It allows users to flexibly set their own evaluation criteria, and uses in-context learning (ICL) to teach judge model these in-house criteria.
TALEC demonstrates a strong capability to accurately reflect human preferences and achieves a correlation of over 80% with human judgments.
arXiv Detail & Related papers (2024-06-25T10:02:42Z) - METAL: Towards Multilingual Meta-Evaluation [12.852595634767901]
This study proposes a framework for an end-to-end assessment of Large Language Models (LLMs) as evaluators in multilingual scenarios.
We create a dataset covering 10 languages containing native speaker judgments for the task of summarization.
We compare the performance of LLM-based evaluators created using GPT-3.5-Turbo, GPT-4, and PaLM2.
arXiv Detail & Related papers (2024-04-02T06:14:54Z) - Can Large Language Models be Trusted for Evaluation? Scalable
Meta-Evaluation of LLMs as Evaluators via Agent Debate [74.06294042304415]
We propose ScaleEval, an agent-debate-assisted meta-evaluation framework.
We release the code for our framework, which is publicly available on GitHub.
arXiv Detail & Related papers (2024-01-30T07:03:32Z) - MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria [49.500322937449326]
Multimodal large language models (MLLMs) have broadened the scope of AI applications.
Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating queries without considering user experiences.
We propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge.
arXiv Detail & Related papers (2023-11-23T12:04:25Z) - MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language
Feedback [78.60644407028022]
We introduce MINT, a benchmark that evaluates large language models' ability to solve tasks with multi-turn interactions.
LLMs generally benefit from tools and language feedback, with performance gains of 1-8% for each turn of tool use.
LLMs evaluated, supervised instruction-finetuning (SIFT) and reinforcement learning from human feedback (RLHF) generally hurt multi-turn capabilities.
arXiv Detail & Related papers (2023-09-19T15:25:42Z) - L-Eval: Instituting Standardized Evaluation for Long Context Language
Models [91.05820785008527]
We propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs)
We build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs.
Results show that popular n-gram matching metrics generally can not correlate well with human judgment.
arXiv Detail & Related papers (2023-07-20T17:59:41Z) - Do LLMs Understand User Preferences? Evaluating LLMs On User Rating
Prediction [15.793007223588672]
Large Language Models (LLMs) have demonstrated exceptional capabilities in generalizing to new tasks in a zero-shot or few-shot manner.
We investigate various LLMs in different sizes, ranging from 250M to 540B parameters and evaluate their performance in zero-shot, few-shot, and fine-tuning scenarios.
arXiv Detail & Related papers (2023-05-10T21:43:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.