Key Considerations for Domain Expert Involvement in LLM Design and Evaluation: An Ethnographic Study
- URL: http://arxiv.org/abs/2602.14357v1
- Date: Mon, 16 Feb 2026 00:21:58 GMT
- Title: Key Considerations for Domain Expert Involvement in LLM Design and Evaluation: An Ethnographic Study
- Authors: Annalisa Szymanski, Oghenemaro Anuyah, Toby Jia-Jun Li, Ronald A. Metoyer,
- Abstract summary: Large Language Models (LLMs) are increasingly developed for use in complex professional domains.<n>This paper examines the challenges and trade-offs in LLM development through a 12-week ethnographer study.
- Score: 28.306813921648224
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are increasingly developed for use in complex professional domains, yet little is known about how teams design and evaluate these systems in practice. This paper examines the challenges and trade-offs in LLM development through a 12-week ethnographic study of a team building a pedagogical chatbot. The researcher observed design and evaluation activities and conducted interviews with both developers and domain experts. Analysis revealed four key practices: creating workarounds for data collection, turning to augmentation when expert input was limited, co-developing evaluation criteria with experts, and adopting hybrid expert-developer-LLM evaluation strategies. These practices show how teams made strategic decisions under constraints and demonstrate the central role of domain expertise in shaping the system. Challenges included expert motivation and trust, difficulties structuring participatory design, and questions around ownership and integration of expert knowledge. We propose design opportunities for future LLM development workflows that emphasize AI literacy, transparent consent, and frameworks recognizing evolving expert roles.
Related papers
- Expert Preference-based Evaluation of Automated Related Work Generation [54.29459509574242]
We propose GREP, a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert-specific preferences.<n>For better accessibility, we design two variants of GREP: a more precise variant with proprietary LLMs as evaluators, and a cheaper alternative with open-weight LLMs.
arXiv Detail & Related papers (2025-08-11T13:08:07Z) - Two Experts Are All You Need for Steering Thinking: Reinforcing Cognitive Effort in MoE Reasoning Models Without Additional Training [86.70255651945602]
We introduce a novel inference-time steering methodology called Reinforcing Cognitive Experts (RICE)<n>RICE aims to improve reasoning performance without additional training or complexs.<n> Empirical evaluations with leading MoE-based LRMs demonstrate noticeable and consistent improvements in reasoning accuracy, cognitive efficiency, and cross-domain generalization.
arXiv Detail & Related papers (2025-05-20T17:59:16Z) - Evaluating Machine Expertise: How Graduate Students Develop Frameworks for Assessing GenAI Content [1.967444231154626]
This paper examines how graduate students develop frameworks for evaluating machine-generated expertise in web-based interactions with large language models (LLMs)<n>Our findings reveal that students construct evaluation frameworks shaped by three main factors: professional identity, verification capabilities, and system navigation experience.
arXiv Detail & Related papers (2025-04-24T22:24:14Z) - A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems [93.8285345915925]
Reasoning is a fundamental cognitive process that enables logical inference, problem-solving, and decision-making.<n>With the rapid advancement of large language models (LLMs), reasoning has emerged as a key capability that distinguishes advanced AI systems.<n>We categorize existing methods along two dimensions: (1) Regimes, which define the stage at which reasoning is achieved; and (2) Architectures, which determine the components involved in the reasoning process.
arXiv Detail & Related papers (2025-04-12T01:27:49Z) - What Makes An Expert? Reviewing How ML Researchers Define "Expert" [4.6346970187885885]
We review 112 academic publications that explicitly reference 'expert' and 'expertise'
We find that expertise is often undefined and forms of knowledge outside of formal education are rarely sought.
We discuss the ways experts are engaged in ML development in relation to deskilling, the social construction of expertise, and implications for responsible AI development.
arXiv Detail & Related papers (2024-10-31T19:51:28Z) - A Novel Psychometrics-Based Approach to Developing Professional Competency Benchmark for Large Language Models [0.0]
We propose a comprehensive approach to benchmark development based on rigorous psychometric principles.
We make the first attempt to illustrate this approach by creating a new benchmark in the field of pedagogy and education.
We construct a novel benchmark guided by the Bloom's taxonomy and rigorously designed by a consortium of education experts trained in test development.
arXiv Detail & Related papers (2024-10-29T19:32:43Z) - Unveiling and Consulting Core Experts in Retrieval-Augmented MoE-based LLMs [64.9693406713216]
Internal mechanisms that contribute to the effectiveness of RAG systems remain underexplored.
Our experiments reveal that several core groups of experts are primarily responsible for RAG-related behaviors.
We propose several strategies to enhance RAG's efficiency and effectiveness through expert activation.
arXiv Detail & Related papers (2024-10-20T16:08:54Z) - PersonaFlow: Designing LLM-Simulated Expert Perspectives for Enhanced Research Ideation [12.593617990325528]
PersonaFlow is a system designed to provide multiple perspectives by using LLMs to simulate domain-specific experts.<n>Our user studies showed that the new design increased the perceived relevance and creativity of ideated research directions.<n>Users' ability to customize expert profiles significantly improved their sense of agency, which can potentially mitigate their over-reliance on AI.
arXiv Detail & Related papers (2024-09-19T07:54:29Z) - ProSwitch: Knowledge-Guided Instruction Tuning to Switch Between Professional and Non-Professional Responses [56.949741308866535]
Large Language Models (LLMs) have demonstrated efficacy in various linguistic applications.<n>This study introduces a novel approach, named ProSwitch, which enables a language model to switch between professional and non-professional answers.
arXiv Detail & Related papers (2024-03-14T06:49:16Z) - Exploring the Cognitive Knowledge Structure of Large Language Models: An
Educational Diagnostic Assessment Approach [50.125704610228254]
Large Language Models (LLMs) have not only exhibited exceptional performance across various tasks, but also demonstrated sparks of intelligence.
Recent studies have focused on assessing their capabilities on human exams and revealed their impressive competence in different domains.
We conduct an evaluation using MoocRadar, a meticulously annotated human test dataset based on Bloom taxonomy.
arXiv Detail & Related papers (2023-10-12T09:55:45Z) - (Re)Defining Expertise in Machine Learning Development [3.096615629099617]
We conduct a systematic literature review of machine learning research to understand 1) the bases on which expertise is defined and recognized and 2) the roles experts play in ML development.
Our goal is to produce a high-level taxonomy to highlight limits and opportunities in how experts are identified and engaged in ML research.
arXiv Detail & Related papers (2023-02-08T21:10:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.