Towards Personalized Evaluation of Large Language Models with An
Anonymous Crowd-Sourcing Platform
- URL: http://arxiv.org/abs/2403.08305v1
- Date: Wed, 13 Mar 2024 07:31:20 GMT
- Title: Towards Personalized Evaluation of Large Language Models with An
Anonymous Crowd-Sourcing Platform
- Authors: Mingyue Cheng, Hao Zhang, Jiqian Yang, Qi Liu, Li Li, Xin Huang, Liwei
Song, Zhi Li, Zhenya Huang, Enhong Chen
- Abstract summary: We propose a novel anonymous crowd-sourcing evaluation platform, BingJian, for large language models.
Through this platform, users have the opportunity to submit their questions, testing the models on a personalized and potentially broader range of capabilities.
- Score: 64.76104135495576
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language model evaluation plays a pivotal role in the enhancement of
its capacity. Previously, numerous methods for evaluating large language models
have been proposed in this area. Despite their effectiveness, these existing
works mainly focus on assessing objective questions, overlooking the capability
to evaluate subjective questions which is extremely common for large language
models. Additionally, these methods predominantly utilize centralized datasets
for evaluation, with question banks concentrated within the evaluation
platforms themselves. Moreover, the evaluation processes employed by these
platforms often overlook personalized factors, neglecting to consider the
individual characteristics of both the evaluators and the models being
evaluated. To address these limitations, we propose a novel anonymous
crowd-sourcing evaluation platform, BingJian, for large language models that
employs a competitive scoring mechanism where users participate in ranking
models based on their performance. This platform stands out not only for its
support of centralized evaluations to assess the general capabilities of models
but also for offering an open evaluation gateway. Through this gateway, users
have the opportunity to submit their questions, testing the models on a
personalized and potentially broader range of capabilities. Furthermore, our
platform introduces personalized evaluation scenarios, leveraging various forms
of human-computer interaction to assess large language models in a manner that
accounts for individual user preferences and contexts. The demonstration of
BingJian can be accessed at https://github.com/Mingyue-Cheng/Bingjian.
Related papers
- PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation [0.0]
We introduce a novel benchmark for evaluating the role-playing capabilities of language models.
The framework consists of three main components: a player model assuming a specific character role, an interrogator model simulating user behavior, and a judge model evaluating conversation quality.
arXiv Detail & Related papers (2024-09-10T19:00:44Z) - Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy [27.454549324141087]
We propose a novel VQA benchmark based on well-known visual classification datasets.
We also suggest using the semantic hierarchy of the label space to ask automatically generated follow-up questions about the ground-truth category.
Our contributions aim to lay the foundation for more precise and meaningful assessments.
arXiv Detail & Related papers (2024-02-11T18:26:18Z) - F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods [102.98899881389211]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic.
For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z) - Which Prompts Make The Difference? Data Prioritization For Efficient
Human LLM Evaluation [9.452326973655445]
We find that metric-based methods enhance the efficiency of human evaluations by minimizing the number of required annotations.
We show that our method is effective across widely used model families, reducing instances of indecisive (or "tie") outcomes by up to 54%.
This potential reduction in required human effort positions our approach as a valuable strategy in future large language model evaluations.
arXiv Detail & Related papers (2023-10-22T21:48:51Z) - EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities.
Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation.
Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z) - Advancing the Evaluation of Traditional Chinese Language Models: Towards
a Comprehensive Benchmark Suite [17.764840326809797]
We propose a novel set of benchmarks that leverage existing English datasets and are tailored to evaluate language models in Traditional Chinese.
These benchmarks encompass a wide range of tasks, including contextual question-answering, summarization, classification, and table understanding.
In this paper, we evaluate the performance of GPT-3.5, Taiwan-LLaMa-v1.0, and Model 7-C, our proprietary model, on these benchmarks.
arXiv Detail & Related papers (2023-09-15T14:52:23Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
We propose MMBench, a benchmark for assessing the multi-modal capabilities of vision-language models.
MMBench is meticulously curated with well-designed quality control schemes.
MMBench incorporates multiple-choice questions in both English and Chinese versions.
arXiv Detail & Related papers (2023-07-12T16:23:09Z) - SimOAP: Improve Coherence and Consistency in Persona-based Dialogue
Generation via Over-sampling and Post-evaluation [54.66399120084227]
Language models trained on large-scale corpora can generate remarkably fluent results in open-domain dialogue.
For the persona-based dialogue generation task, consistency and coherence are great challenges for language models.
A two-stage SimOAP strategy is proposed, i.e., over-sampling and post-evaluation.
arXiv Detail & Related papers (2023-05-18T17:23:00Z) - Dialectical language model evaluation: An initial appraisal of the
commonsense spatial reasoning abilities of LLMs [10.453404263936335]
We explore an alternative dialectical evaluation of language models for commonsense reasoning.
The goal of this kind of evaluation is not to obtain an aggregate performance value but to find failures and map the boundaries of the system.
In this paper we conduct some qualitative investigations of this kind of evaluation for the particular case of spatial reasoning.
arXiv Detail & Related papers (2023-04-22T06:28:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.