Assessing Personalized AI Mentoring with Large Language Models in the Computing Field
- URL: http://arxiv.org/abs/2412.08430v1
- Date: Wed, 11 Dec 2024 14:51:13 GMT
- Title: Assessing Personalized AI Mentoring with Large Language Models in the Computing Field
- Authors: Xiao Luo, Sean O'Connell, Shamima Mithun,
- Abstract summary: GPT-4, LLaMA 3, and Palm 2 were evaluated using a zero-shot learning approach without human intervention.
The analysis of frequently used words in the responses indicates that GPT-4 offers more personalized mentoring.
- Score: 3.855858854481047
- License:
- Abstract: This paper provides an in-depth evaluation of three state-of-the-art Large Language Models (LLMs) for personalized career mentoring in the computing field, using three distinct student profiles that consider gender, race, and professional levels. We evaluated the performance of GPT-4, LLaMA 3, and Palm 2 using a zero-shot learning approach without human intervention. A quantitative evaluation was conducted through a custom natural language processing analytics pipeline to highlight the uniqueness of the responses and to identify words reflecting each student's profile, including race, gender, or professional level. The analysis of frequently used words in the responses indicates that GPT-4 offers more personalized mentoring compared to the other two LLMs. Additionally, a qualitative evaluation was performed to see if human experts reached similar conclusions. The analysis of survey responses shows that GPT-4 outperformed the other two LLMs in delivering more accurate and useful mentoring while addressing specific challenges with encouragement languages. Our work establishes a foundation for developing personalized mentoring tools based on LLMs, incorporating human mentors in the process to deliver a more impactful and tailored mentoring experience.
Related papers
- Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs [50.0874045899661]
We introduce CharacterBot, a model designed to replicate both the linguistic patterns and distinctive thought processes of a character.
Using Lu Xun as a case study, we propose four training tasks derived from his 17 essay collections.
These include a pre-training task focused on mastering external linguistic structures and knowledge, as well as three fine-tuning tasks.
We evaluate CharacterBot on three tasks for linguistic accuracy and opinion comprehension, demonstrating that it significantly outperforms the baselines on our adapted metrics.
arXiv Detail & Related papers (2025-02-18T16:11:54Z) - HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF)
In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination.
We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z) - Can GPT-4 do L2 analytic assessment? [34.445391091278786]
Automated essay scoring (AES) to evaluate second language (L2) proficiency has been a firmly established technology used in educational contexts for decades.
In this paper, we perform a series of experiments using GPT-4 in a zero-shot fashion on a publicly available dataset annotated with holistic scores.
We observe significant correlations between the automatically predicted analytic scores and multiple features associated with the individual proficiency components.
arXiv Detail & Related papers (2024-04-29T10:00:00Z) - GPT-4 Surpassing Human Performance in Linguistic Pragmatics [0.0]
This study investigates the ability of Large Language Models (LLMs) to comprehend and interpret linguistic pragmatics.
Using Grice's communication principles, LLMs and human subjects were evaluated based on their responses to various dialogue-based tasks.
The findings revealed the superior performance and speed of LLMs, particularly GPT4, over human subjects in interpreting pragmatics.
arXiv Detail & Related papers (2023-12-15T05:40:15Z) - From Voices to Validity: Leveraging Large Language Models (LLMs) for
Textual Analysis of Policy Stakeholder Interviews [14.135107583299277]
This study explores the integration of Large Language Models (LLMs) with human expertise to enhance text analysis of stakeholder interviews regarding K-12 education policy within one U.S. state.
Using a mixed-methods approach, human experts developed a codebook and coding processes as informed by domain knowledge and unsupervised topic modeling results.
Results reveal that while GPT-4 thematic coding aligned with human coding by 77.89% at specific themes, expanding to broader themes increased congruence to 96.02%, surpassing traditional Natural Language Processing (NLP) methods by over 25%.
arXiv Detail & Related papers (2023-12-02T18:55:14Z) - PsyCoT: Psychological Questionnaire as Powerful Chain-of-Thought for
Personality Detection [50.66968526809069]
We propose a novel personality detection method, called PsyCoT, which mimics the way individuals complete psychological questionnaires in a multi-turn dialogue manner.
Our experiments demonstrate that PsyCoT significantly improves the performance and robustness of GPT-3.5 in personality detection.
arXiv Detail & Related papers (2023-10-31T08:23:33Z) - A Large Language Model Approach to Educational Survey Feedback Analysis [0.0]
This paper assesses the potential for the large language models (LLMs) GPT-4 and GPT-3.5 to aid in deriving insight from education feedback surveys.
arXiv Detail & Related papers (2023-09-29T17:57:23Z) - Are Large Language Model-based Evaluators the Solution to Scaling Up
Multilingual Evaluation? [20.476500441734427]
Large Language Models (LLMs) excel in various Natural Language Processing (NLP) tasks.
Their evaluation, particularly in languages beyond the top $20$, remains inadequate due to existing benchmarks and metrics limitations.
arXiv Detail & Related papers (2023-09-14T06:41:58Z) - Revisiting the Reliability of Psychological Scales on Large Language Models [62.57981196992073]
This study aims to determine the reliability of applying personality assessments to Large Language Models.
Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory.
arXiv Detail & Related papers (2023-05-31T15:03:28Z) - Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples.
With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot.
We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z) - ElitePLM: An Empirical Study on General Language Ability Evaluation of
Pretrained Language Models [78.08792285698853]
We present a large-scale empirical study on general language ability evaluation of pretrained language models (ElitePLM)
Our empirical results demonstrate that: (1) PLMs with varying training objectives and strategies are good at different ability tests; (2) fine-tuning PLMs in downstream tasks is usually sensitive to the data size and distribution; and (3) PLMs have excellent transferability between similar tasks.
arXiv Detail & Related papers (2022-05-03T14:18:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.