Can ChatGPT and Bard Generate Aligned Assessment Items? A Reliability
Analysis against Human Performance
- URL: http://arxiv.org/abs/2304.05372v1
- Date: Sun, 9 Apr 2023 04:53:15 GMT
- Title: Can ChatGPT and Bard Generate Aligned Assessment Items? A Reliability
Analysis against Human Performance
- Authors: Abdolvahab Khademi
- Abstract summary: ChatGPT and Bard are AI chatbots based on Large Language Models (LLM)
In education, these AI technologies have been tested for applications in assessment and teaching.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: ChatGPT and Bard are AI chatbots based on Large Language Models (LLM) that
are slated to promise different applications in diverse areas. In education,
these AI technologies have been tested for applications in assessment and
teaching. In assessment, AI has long been used in automated essay scoring and
automated item generation. One psychometric property that these tools must have
to assist or replace humans in assessment is high reliability in terms of
agreement between AI scores and human raters. In this paper, we measure the
reliability of OpenAI ChatGP and Google Bard LLMs tools against experienced and
trained humans in perceiving and rating the complexity of writing prompts.
Intraclass correlation (ICC) as a performance metric showed that the
inter-reliability of both the OpenAI ChatGPT and the Google Bard were low
against the gold standard of human ratings.
Related papers
- Distributed agency in second language learning and teaching through generative AI [0.0]
ChatGPT can provide informal second language practice through chats in written or voice forms.
Instructors can use AI to build learning and assessment materials in a variety of media.
arXiv Detail & Related papers (2024-03-29T14:55:40Z) - Developing generative AI chatbots conceptual framework for higher education [0.0]
This study aims to comprehend the implications of AIgeneratives on higher education and pinpoint critical elements for their efficacious implementation.
The results demonstrate how much AI chatbots can do to improve student engagement, streamline the educational process, and support administrative and research duties.
But there are also clear difficulties, such as unfavorable student sentiments, doubts about the veracity of material produced by AI, and unease and nervousness with new technologies.
arXiv Detail & Related papers (2024-03-28T10:40:26Z) - Beyond Static Evaluation: A Dynamic Approach to Assessing AI Assistants' API Invocation Capabilities [48.922660354417204]
We propose Automated Dynamic Evaluation (AutoDE) to assess an assistant's API call capability without human involvement.
In our framework, we endeavor to closely mirror genuine human conversation patterns in human-machine interactions.
arXiv Detail & Related papers (2024-03-17T07:34:12Z) - ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate [57.71597869337909]
We build a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models.
Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments.
arXiv Detail & Related papers (2023-08-14T15:13:04Z) - Perception, performance, and detectability of conversational artificial
intelligence across 32 university courses [15.642614735026106]
We compare the performance of ChatGPT against students on 32 university-level courses.
We find that ChatGPT's performance is comparable, if not superior, to that of students in many courses.
We find an emerging consensus among students to use the tool, and among educators to treat this as plagiarism.
arXiv Detail & Related papers (2023-05-07T10:37:51Z) - AI, write an essay for me: A large-scale comparison of human-written
versus ChatGPT-generated essays [66.36541161082856]
ChatGPT and similar generative AI models have attracted hundreds of millions of users.
This study compares human-written versus ChatGPT-generated argumentative student essays.
arXiv Detail & Related papers (2023-04-24T12:58:28Z) - Evaluating Human-Language Model Interaction [79.33022878034627]
We develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems.
We design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation.
We find that better non-interactive performance does not always translate to better human-LM interaction.
arXiv Detail & Related papers (2022-12-19T18:59:45Z) - The Role of AI in Drug Discovery: Challenges, Opportunities, and
Strategies [97.5153823429076]
The benefits, challenges and drawbacks of AI in this field are reviewed.
The use of data augmentation, explainable AI, and the integration of AI with traditional experimental methods are also discussed.
arXiv Detail & Related papers (2022-12-08T23:23:39Z) - Can Machines Imitate Humans? Integrative Turing Tests for Vision and Language Demonstrate a Narrowing Gap [45.6806234490428]
We benchmark current AIs in their abilities to imitate humans in three language tasks and three vision tasks.
Experiments involved 549 human agents plus 26 AI agents for dataset creation, and 1,126 human judges plus 10 AI judges.
Results reveal that current AIs are not far from being able to impersonate humans in complex language and vision challenges.
arXiv Detail & Related papers (2022-11-23T16:16:52Z) - Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring
Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics.
We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.