Related papers: Evaluating the Performance of Large Language Models in Competitive Programming: A Multi-Year, Multi-Grade Analysis

Evaluating the Performance of Large Language Models in Competitive Programming: A Multi-Year, Multi-Grade Analysis

URL: http://arxiv.org/abs/2409.09054v1
Date: Sat, 31 Aug 2024 10:39:54 GMT
Title: Evaluating the Performance of Large Language Models in Competitive Programming: A Multi-Year, Multi-Grade Analysis
Authors: Adrian Marius Dumitran, Adrian Catalin Badea, Stefan-Gabriel Muscalu,
Abstract summary: This study explores the performance of large language models (LLMs) in solving competitive programming problems from the Romanian Informatics Olympiad at the county level. We collected and analyzed a dataset comprising 304 challenges from 2002 to 2023. The analysis revealed significant variations in LLM performance across different grades and problem types.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: This study explores the performance of large language models (LLMs) in solving competitive programming problems from the Romanian Informatics Olympiad at the county level. Romania, a leading nation in computer science competitions, provides an ideal environment for evaluating LLM capabilities due to its rich history and stringent competition standards. We collected and analyzed a dataset comprising 304 challenges from 2002 to 2023, focusing on solutions written by LLMs in C++ and Python for these problems. Our primary goal is to understand why LLMs perform well or poorly on different tasks. We evaluated various models, including closed-source models like GPT-4 and open-weight models such as CodeLlama and RoMistral, using a standardized process involving multiple attempts and feedback rounds. The analysis revealed significant variations in LLM performance across different grades and problem types. Notably, GPT-4 showed strong performance, indicating its potential use as an educational tool for middle school students. We also observed differences in code quality and style across various LLMs

Related papers

LLM-ProS: Analyzing Large Language Models' Performance in Competitive Problem Solving [1.5106583432923495]
This paper introduces a novel evaluation technique, LLM-ProS, to assess the performance of state-of-the-art LLMs. Using a curated dataset of 166 World Finals problems from 2011 to 2024, we benchmark the models' reasoning, accuracy, and efficiency. Our results reveal significant differences in the models' abilities to generalize, adapt, and solve novel problems.
arXiv Detail & Related papers (2025-02-04T18:55:14Z)
Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs. LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data. Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z)
Empowering Large Language Models in Wireless Communication: A Novel Dataset and Fine-Tuning Framework [81.29965270493238]
We develop a specialized dataset aimed at enhancing the evaluation and fine-tuning of large language models (LLMs) for wireless communication applications. The dataset includes a diverse set of multi-hop questions, including true/false and multiple-choice types, spanning varying difficulty levels from easy to hard. We introduce a Pointwise V-Information (PVI) based fine-tuning method, providing a detailed theoretical analysis and justification for its use in quantifying the information content of training data.
arXiv Detail & Related papers (2025-01-16T16:19:53Z)
A Real-World Benchmark for Evaluating Fine-Grained Issue Solving Capabilities of Large Language Models [11.087034068992653]
FAUN-Eval is a benchmark specifically designed to evaluate the Fine-grAined issUe solviNg capabilities of LLMs. It is constructed using a dataset curated from 30 well-known GitHub repositories. We evaluate ten LLMs with FAUN-Eval, including four closed-source and six open-source models.
arXiv Detail & Related papers (2024-11-27T03:25:44Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
Evaluating Language Models for Generating and Judging Programming Feedback [4.743413681603463]
Large language models (LLMs) have transformed research and practice across a wide range of domains. We evaluate the efficiency of open-source LLMs in generating high-quality feedback for programming assignments.
arXiv Detail & Related papers (2024-07-05T21:44:11Z)
Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study [72.24266814625685]
We explore the performance of large language models (LLMs) across the entire software development lifecycle with DevEval. DevEval features four programming languages, multiple domains, high-quality data collection, and carefully designed and verified metrics for each task. Empirical studies show that current LLMs, including GPT-4, fail to solve the challenges presented within DevEval.
arXiv Detail & Related papers (2024-03-13T15:13:44Z)
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code [34.03774442237902]
Large Language Models applied to code-related applications have emerged as a prominent field. Existing evaluation benchmarks (e.g., HumanEval, MBPP) are no longer sufficient for assessing their capabilities. We propose LiveCodeBench, a comprehensive and contamination-free evaluation of LLMs for code.
arXiv Detail & Related papers (2024-03-12T17:58:04Z)
Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z)
L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs) We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods. In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.