Performance of Large Language Models in a Computer Science Degree
Program
- URL: http://arxiv.org/abs/2308.02432v1
- Date: Mon, 24 Jul 2023 14:17:00 GMT
- Title: Performance of Large Language Models in a Computer Science Degree
Program
- Authors: Tim Kr\"uger, Michael Gref
- Abstract summary: This paper presents findings on the performance of different large language models in a university of applied sciences' undergraduate computer science degree program.
By prompting the models with lecture material, exercise tasks, and past exams, we aim to evaluate their proficiency across different computer science domains.
We found that ChatGPT-3.5 averaged 79.9% of the total score in 10 tested modules, BingAI achieved 68.4%, and LLaMa, in the 65 billion parameter variant, 20%.
- Score: 0.5330240017302619
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models such as ChatGPT-3.5 and GPT-4.0 are ubiquitous and
dominate the current discourse. Their transformative capabilities have led to a
paradigm shift in how we interact with and utilize (text-based) information.
Each day, new possibilities to leverage the capabilities of these models
emerge. This paper presents findings on the performance of different large
language models in a university of applied sciences' undergraduate computer
science degree program. Our primary objective is to assess the effectiveness of
these models within the curriculum by employing them as educational aids. By
prompting the models with lecture material, exercise tasks, and past exams, we
aim to evaluate their proficiency across different computer science domains. We
showcase the strong performance of current large language models while
highlighting limitations and constraints within the context of such a degree
program. We found that ChatGPT-3.5 averaged 79.9% of the total score in 10
tested modules, BingAI achieved 68.4%, and LLaMa, in the 65 billion parameter
variant, 20%. Despite these convincing results, even GPT-4.0 would not pass the
degree program - due to limitations in mathematical calculations.
Related papers
- Language Models for Code Completion: A Practical Evaluation [13.174471984950857]
This study provides both quantitative and qualitative assessments of three public code language models when completing real-world code.
We collected real auto-completion usage data for over a year from more than 1200 users, resulting in over 600K valid completions.
We found that 66.3% of failures were due to the models' limitations, 24.4% occurred due to inappropriate model usage in a development context, and 9.3% were valid requests that developers overwrote.
arXiv Detail & Related papers (2024-02-25T20:43:55Z) - The potential of large language models for improving probability
learning: A study on ChatGPT3.5 and first-year computer engineering students [0.565395466029518]
ChatGPT is a large-scale language model that can solve probability problems.
ChatGPT is used in solving probability problems typically presented in computer engineering exams.
The model's ability to deliver high-quality explanations and illustrate solutions in any programming language suggests that large language models have the potential to serve as learning assistants.
arXiv Detail & Related papers (2023-10-09T12:54:58Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - Evaluating ChatGPT and GPT-4 for Visual Programming [20.64766977405438]
We evaluate two models, ChatGPT (based on GPT-3.5) and GPT-4, in visual programming domains for various scenarios.
Our results show that these models perform poorly and struggle to combine spatial, logical, and programming skills crucial for visual programming.
arXiv Detail & Related papers (2023-07-30T22:13:20Z) - Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4,
and Human Tutors [21.227955181065948]
We systematically evaluate two models, ChatGPT (based on GPT-3.5) and GPT-4, and compare their performance with human tutors for a variety of scenarios.
Our results show that GPT-4 drastically outperforms ChatGPT (based on GPT-3.5) and comes close to human tutors' performance for several scenarios.
arXiv Detail & Related papers (2023-06-29T17:57:40Z) - Thrilled by Your Progress! Large Language Models (GPT-4) No Longer
Struggle to Pass Assessments in Higher Education Programming Courses [0.0]
GPT models evolved from completely failing the typical programming class' assessments to confidently passing the courses with no human involvement.
This study provides evidence that programming instructors need to prepare for a world in which there is an easy-to-use technology that can be utilized by learners to collect passing scores.
arXiv Detail & Related papers (2023-06-15T22:12:34Z) - Large Language Models in the Workplace: A Case Study on Prompt
Engineering for Job Type Classification [58.720142291102135]
This case study investigates the task of job classification in a real-world setting.
The goal is to determine whether an English-language job posting is appropriate for a graduate or entry-level position.
arXiv Detail & Related papers (2023-03-13T14:09:53Z) - Beyond the Imitation Game: Quantifying and extrapolating the
capabilities of language models [648.3665819567409]
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale.
Big-bench consists of 204 tasks, contributed by 450 authors across 132 institutions.
We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench.
arXiv Detail & Related papers (2022-06-09T17:05:34Z) - PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.
We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z) - Reasoning Like Program Executors [48.819113224699976]
POET empowers language models to harvest the reasoning knowledge possessed in program executors via a data-driven approach.
PoET can significantly boost model performance on natural language reasoning.
PoET opens a new gate on reasoning-enhancement pre-training.
arXiv Detail & Related papers (2022-01-27T12:28:24Z) - Few-shot Learning with Multilingual Language Models [66.49496434282564]
We train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages.
Our largest model sets new state of the art in few-shot learning in more than 20 representative languages.
We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning.
arXiv Detail & Related papers (2021-12-20T16:52:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.