Bridging HCI and AI Research for the Evaluation of Conversational SE Assistants
- URL: http://arxiv.org/abs/2502.07956v1
- Date: Tue, 11 Feb 2025 21:09:24 GMT
- Title: Bridging HCI and AI Research for the Evaluation of Conversational SE Assistants
- Authors: Jonan Richards, Mairieli Wessel,
- Abstract summary: Large Language Models (LLMs) are increasingly adopted in software engineering, recently in the form of conversational assistants.
We advocate combining insights from human-computer interaction (HCI) and artificial intelligence (AI) research to enable human-centered automatic evaluation.
- Score: 0.0
- License:
- Abstract: As Large Language Models (LLMs) are increasingly adopted in software engineering, recently in the form of conversational assistants, ensuring these technologies align with developers' needs is essential. The limitations of traditional human-centered methods for evaluating LLM-based tools at scale raise the need for automatic evaluation. In this paper, we advocate combining insights from human-computer interaction (HCI) and artificial intelligence (AI) research to enable human-centered automatic evaluation of LLM-based conversational SE assistants. We identify requirements for such evaluation and challenges down the road, working towards a framework that ensures these assistants are designed and deployed in line with user needs.
Related papers
- Constraining Participation: Affordances of Feedback Features in Interfaces to Large Language Models [49.74265453289855]
Large language models (LLMs) are now accessible to anyone with a computer, a web browser, and an internet connection via browser-based interfaces.
This paper examines the affordances of interactive feedback features in ChatGPT's interface, analysing how they shape user input and participation in iteration.
arXiv Detail & Related papers (2024-08-27T13:50:37Z) - AI-Based IVR [0.0]
This article examines the application of artificial intelligence (AI) technologies to enhance the efficiency of systems in call centers.
A proposed approach is based on the integration of speech-to-text conversion, text query classification using large language models (LLM), and speech synthesis.
Special attention is given to adapting these technologies to work with the Kazakh language.
arXiv Detail & Related papers (2024-08-20T05:04:40Z) - Towards Scalable Automated Alignment of LLMs: A Survey [54.820256625544225]
This paper systematically reviews the recently emerging methods of automated alignment.
We categorize existing automated alignment methods into 4 major categories based on the sources of alignment signals.
We discuss the essential factors that make automated alignment technologies feasible and effective from the fundamental role of alignment.
arXiv Detail & Related papers (2024-06-03T12:10:26Z) - Human-Centered Automation [0.3626013617212666]
The paper argues for the emerging area of Human-Centered Automation (HCA), which prioritizes user needs and preferences in the design and development of automation systems.
The paper discusses the limitations of existing automation approaches, the challenges in integrating AI and RPA, and the benefits of human-centered automation for productivity, innovation, and democratizing access to these technologies.
arXiv Detail & Related papers (2024-05-24T22:12:28Z) - Beyond Static Evaluation: A Dynamic Approach to Assessing AI Assistants' API Invocation Capabilities [48.922660354417204]
We propose Automated Dynamic Evaluation (AutoDE) to assess an assistant's API call capability without human involvement.
In our framework, we endeavor to closely mirror genuine human conversation patterns in human-machine interactions.
arXiv Detail & Related papers (2024-03-17T07:34:12Z) - ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate [57.71597869337909]
We build a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models.
Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments.
arXiv Detail & Related papers (2023-08-14T15:13:04Z) - Requirements Engineering Framework for Human-centered Artificial
Intelligence Software Systems [9.642259026572175]
We present a new framework developed based on human-centered AI guidelines and a user survey to aid in collecting requirements for human-centered AI-based software.
The framework is applied to a case study to elicit and model requirements for enhancing the quality of 360 degreevideos intended for virtual reality (VR) users.
arXiv Detail & Related papers (2023-03-06T06:37:50Z) - The Roles and Modes of Human Interactions with Automated Machine
Learning Systems [7.670270099306412]
Automated machine learning (AutoML) systems continue to progress in both sophistication and performance.
It becomes important to understand the how' and why' of human-computer interaction (HCI) within these frameworks.
This review serves to identify key research directions aimed at better facilitating the roles and modes of human interactions with both current and future AutoML systems.
arXiv Detail & Related papers (2022-05-09T09:28:43Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - Watch-And-Help: A Challenge for Social Perception and Human-AI
Collaboration [116.28433607265573]
We introduce Watch-And-Help (WAH), a challenge for testing social intelligence in AI agents.
In WAH, an AI agent needs to help a human-like agent perform a complex household task efficiently.
We build VirtualHome-Social, a multi-agent household environment, and provide a benchmark including both planning and learning based baselines.
arXiv Detail & Related papers (2020-10-19T21:48:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.