Related papers: Rethinking Model Evaluation as Narrowing the Socio-Technical Gap

Rethinking Model Evaluation as Narrowing the Socio-Technical Gap

URL: http://arxiv.org/abs/2306.03100v4
Date: Fri, 31 Jan 2025 14:59:17 GMT
Title: Rethinking Model Evaluation as Narrowing the Socio-Technical Gap
Authors: Q. Vera Liao, Ziang Xiao,
Abstract summary: We argue that model evaluation practices must take on a critical task to cope with the challenges and responsibilities brought by this homogenization.<n>We urge the community to develop evaluation methods based on real-world contexts and human requirements.
Score: 47.632123167141245
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The recent development of generative large language models (LLMs) poses new challenges for model evaluation that the research community and industry have been grappling with. While the versatile capabilities of these models ignite much excitement, they also inevitably make a leap toward homogenization: powering a wide range of applications with a single, often referred to as ``general-purpose'', model. In this position paper, we argue that model evaluation practices must take on a critical task to cope with the challenges and responsibilities brought by this homogenization: providing valid assessments for whether and how much human needs in diverse downstream use cases can be satisfied by the given model (\textit{socio-technical gap}). By drawing on lessons about improving research realism from the social sciences, human-computer interaction (HCI), and the interdisciplinary field of explainable AI (XAI), we urge the community to develop evaluation methods based on real-world contexts and human requirements, and embrace diverse evaluation methods with an acknowledgment of trade-offs between realisms and pragmatic costs to conduct the evaluation. By mapping HCI and current NLG evaluation methods, we identify opportunities for evaluation methods for LLMs to narrow the socio-technical gap and pose open questions.

Related papers

Revisiting LLM Evaluation through Mechanism Interpretability: a New Metric and Model Utility Law [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications. We propose a novel metric, the Model Utilization Index (MUI), which introduces mechanism interpretability techniques to complement traditional performance metrics.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment. To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z)
The Potential and Challenges of Evaluating Attitudes, Opinions, and Values in Large Language Models [28.743404185915697]
This paper provides a comprehensive overview of recent works on the evaluation of Attitudes, Opinions, Values (AOVs) in Large Language Models (LLMs) By doing so, we address the potential and challenges with respect to understanding the model, human-AI alignment, and downstream application in social sciences.
arXiv Detail & Related papers (2024-06-16T22:59:18Z)
Collective Constitutional AI: Aligning a Language Model with Public Input [20.95333081841239]
There is growing consensus that language model (LM) developers should not be the sole deciders of LM behavior. We present Collective Constitutional AI (CCAI): a multi-stage process for sourcing and integrating public input into LMs. We demonstrate the real-world practicality of this approach by creating what is, to our knowledge, the first LM fine-tuned with collectively sourced public input.
arXiv Detail & Related papers (2024-06-12T02:20:46Z)
Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence [5.147767778946168]
We critically assess 23 state-of-the-art Large Language Models (LLMs) benchmarks. Our research uncovered significant limitations, including biases, difficulties in measuring genuine reasoning, adaptability, implementation inconsistencies, prompt engineering complexity, diversity, and the overlooking of cultural and ideological norms.
arXiv Detail & Related papers (2024-02-15T11:08:10Z)
AntEval: Evaluation of Social Interaction Competencies in LLM-Driven Agents [65.16893197330589]
Large Language Models (LLMs) have demonstrated their ability to replicate human behaviors across a wide range of scenarios. However, their capability in handling complex, multi-character social interactions has yet to be fully explored. We introduce the Multi-Agent Interaction Evaluation Framework (AntEval), encompassing a novel interaction framework and evaluation methods.
arXiv Detail & Related papers (2024-01-12T11:18:00Z)
Post Turing: Mapping the landscape of LLM Evaluation [22.517544562890663]
This paper traces the historical trajectory of Large Language Models (LLMs) evaluations, from the foundational questions posed by Alan Turing to the modern era of AI research. We emphasize the pressing need for a unified evaluation system, given the broader societal implications of these models. This work serves as a call for the AI community to collaboratively address the challenges of LLM evaluation, ensuring their reliability, fairness, and societal benefit.
arXiv Detail & Related papers (2023-11-03T17:24:50Z)
Collaborative Evaluation: Exploring the Synergy of Large Language Models and Humans for Open-ended Generation Evaluation [71.76872586182981]
Large language models (LLMs) have emerged as a scalable and cost-effective alternative to human evaluations. We propose a Collaborative Evaluation pipeline CoEval, involving the design of a checklist of task-specific criteria and the detailed evaluation of texts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z)
Survey of Social Bias in Vision-Language Models [65.44579542312489]
Survey aims to provide researchers with a high-level insight into the similarities and differences of social bias studies in pre-trained models across NLP, CV, and VL. The findings and recommendations presented here can benefit the ML community, fostering the development of fairer and non-biased AI models.
arXiv Detail & Related papers (2023-09-24T15:34:56Z)
Training Socially Aligned Language Models on Simulated Social Interactions [99.39979111807388]
Social alignment in AI systems aims to ensure that these models behave according to established societal values. Current language models (LMs) are trained to rigidly replicate their training corpus in isolation. This work presents a novel training paradigm that permits LMs to learn from simulated social interactions.
arXiv Detail & Related papers (2023-05-26T14:17:36Z)
Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
Human Factors in Model Interpretability: Industry Practices, Challenges, and Needs [28.645803845464915]
We conduct interviews with industry practitioners to understand how they conceive of and design for interpretability while they plan, build, and use their models. Based on our results, we differentiate interpretability roles, processes, goals and strategies as they exist within organizations making heavy use of ML models. The characterization of interpretability work that emerges from our analysis suggests that model interpretability frequently involves cooperation and mental model comparison between people in different roles.
arXiv Detail & Related papers (2020-04-23T19:54:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.