Rethinking Model Evaluation as Narrowing the Socio-Technical Gap
- URL: http://arxiv.org/abs/2306.03100v3
- Date: Thu, 29 Jun 2023 02:33:53 GMT
- Title: Rethinking Model Evaluation as Narrowing the Socio-Technical Gap
- Authors: Q. Vera Liao, Ziang Xiao
- Abstract summary: We argue that model evaluation practices must take on a critical task to cope with the challenges and responsibilities brought by this homogenization.
We urge the community to develop evaluation methods based on real-world socio-requirements.
- Score: 34.08410116336628
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent development of generative and large language models (LLMs) poses
new challenges for model evaluation that the research community and industry
are grappling with. While the versatile capabilities of these models ignite
excitement, they also inevitably make a leap toward homogenization: powering a
wide range of applications with a single, often referred to as
``general-purpose'', model. In this position paper, we argue that model
evaluation practices must take on a critical task to cope with the challenges
and responsibilities brought by this homogenization: providing valid
assessments for whether and how much human needs in downstream use cases can be
satisfied by the given model (socio-technical gap). By drawing on lessons from
the social sciences, human-computer interaction (HCI), and the
interdisciplinary field of explainable AI (XAI), we urge the community to
develop evaluation methods based on real-world socio-requirements and embrace
diverse evaluation methods with an acknowledgment of trade-offs between realism
to socio-requirements and pragmatic costs to conduct the evaluation. By mapping
HCI and current NLG evaluation methods, we identify opportunities for
evaluation methods for LLMs to narrow the socio-technical gap and pose open
questions.
Related papers
- Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment.
To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z) - The Potential and Challenges of Evaluating Attitudes, Opinions, and Values in Large Language Models [28.743404185915697]
This paper provides a comprehensive overview of recent works on the evaluation of Attitudes, Opinions, Values (AOVs) in Large Language Models (LLMs)
By doing so, we address the potential and challenges with respect to understanding the model, human-AI alignment, and downstream application in social sciences.
arXiv Detail & Related papers (2024-06-16T22:59:18Z) - Collective Constitutional AI: Aligning a Language Model with Public Input [20.95333081841239]
There is growing consensus that language model (LM) developers should not be the sole deciders of LM behavior.
We present Collective Constitutional AI (CCAI): a multi-stage process for sourcing and integrating public input into LMs.
We demonstrate the real-world practicality of this approach by creating what is, to our knowledge, the first LM fine-tuned with collectively sourced public input.
arXiv Detail & Related papers (2024-06-12T02:20:46Z) - Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence [5.147767778946168]
We critically assess 23 state-of-the-art Large Language Models (LLMs) benchmarks.
Our research uncovered significant limitations, including biases, difficulties in measuring genuine reasoning, adaptability, implementation inconsistencies, prompt engineering complexity, diversity, and the overlooking of cultural and ideological norms.
arXiv Detail & Related papers (2024-02-15T11:08:10Z) - AntEval: Evaluation of Social Interaction Competencies in LLM-Driven
Agents [65.16893197330589]
Large Language Models (LLMs) have demonstrated their ability to replicate human behaviors across a wide range of scenarios.
However, their capability in handling complex, multi-character social interactions has yet to be fully explored.
We introduce the Multi-Agent Interaction Evaluation Framework (AntEval), encompassing a novel interaction framework and evaluation methods.
arXiv Detail & Related papers (2024-01-12T11:18:00Z) - Post Turing: Mapping the landscape of LLM Evaluation [22.517544562890663]
This paper traces the historical trajectory of Large Language Models (LLMs) evaluations, from the foundational questions posed by Alan Turing to the modern era of AI research.
We emphasize the pressing need for a unified evaluation system, given the broader societal implications of these models.
This work serves as a call for the AI community to collaboratively address the challenges of LLM evaluation, ensuring their reliability, fairness, and societal benefit.
arXiv Detail & Related papers (2023-11-03T17:24:50Z) - Collaborative Evaluation: Exploring the Synergy of Large Language Models
and Humans for Open-ended Generation Evaluation [71.76872586182981]
Large language models (LLMs) have emerged as a scalable and cost-effective alternative to human evaluations.
We propose a Collaborative Evaluation pipeline CoEval, involving the design of a checklist of task-specific criteria and the detailed evaluation of texts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z) - Survey of Social Bias in Vision-Language Models [65.44579542312489]
Survey aims to provide researchers with a high-level insight into the similarities and differences of social bias studies in pre-trained models across NLP, CV, and VL.
The findings and recommendations presented here can benefit the ML community, fostering the development of fairer and non-biased AI models.
arXiv Detail & Related papers (2023-09-24T15:34:56Z) - Training Socially Aligned Language Models on Simulated Social
Interactions [99.39979111807388]
Social alignment in AI systems aims to ensure that these models behave according to established societal values.
Current language models (LMs) are trained to rigidly replicate their training corpus in isolation.
This work presents a novel training paradigm that permits LMs to learn from simulated social interactions.
arXiv Detail & Related papers (2023-05-26T14:17:36Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.