Evaluating the Limitations of Local LLMs in Solving Complex Programming Challenges
- URL: http://arxiv.org/abs/2509.15283v1
- Date: Thu, 18 Sep 2025 14:13:30 GMT
- Title: Evaluating the Limitations of Local LLMs in Solving Complex Programming Challenges
- Authors: Kadin Matotek, Heather Cassel, Md Amiruzzaman, Linh B. Ngo,
- Abstract summary: This study examines the performance of open-source, locally hosted large-language models (LLMs) in handling complex programming tasks.<n>Building on the original Framework for AI-driven Code Generation Evaluation (FACE), the authors retrofit the pipeline to work entirely offline.<n>Results show that the overall pass@1 accuracy is modest for the local models, with the best models performing at approximately half the acceptance rate of the proprietary models.
- Score: 0.31498833540989407
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study examines the performance of today's open-source, locally hosted large-language models (LLMs) in handling complex competitive programming tasks with extended problem descriptions and contexts. Building on the original Framework for AI-driven Code Generation Evaluation (FACE), the authors retrofit the pipeline to work entirely offline through the Ollama runtime, collapsing FACE's sprawling per-problem directory tree into a handful of consolidated JSON files, and adding robust checkpointing so multi-day runs can resume after failures. The enhanced framework generates, submits, and records solutions for the full Kattis corpus of 3,589 problems across eight code-oriented models ranging from 6.7-9 billion parameters. The submission results show that the overall pass@1 accuracy is modest for the local models, with the best models performing at approximately half the acceptance rate of the proprietary models, Gemini 1.5 and ChatGPT-4. These findings expose a persistent gap between private, cost-controlled LLM deployments and state-of-the-art proprietary services, yet also highlight the rapid progress of open models and the practical benefits of an evaluation workflow that organizations can replicate on in-house hardware.
Related papers
- OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling [13.57588221678224]
Large Language Models (LLMs) have demonstrated impressive progress in optimization modeling.<n>The boundaries of their capabilities in automated formulation and problem solving remain poorly understood.<n>We propose OPT-ENGINE, a benchmark framework designed to evaluate LLMs on optimization modeling with controllable and scalable difficulty levels.
arXiv Detail & Related papers (2026-01-09T09:22:33Z) - An Agentic Framework with LLMs for Solving Complex Vehicle Routing Problems [66.60904891478687]
We propose an Agentic Framework with LLMs (AFL) for solving complex vehicle routing problems.<n>AFL directly extracts knowledge from raw inputs and enables self-contained code generation.<n>We show that AFL substantially outperforms existing LLM-based baselines in both code reliability and solution feasibility.
arXiv Detail & Related papers (2025-10-19T03:59:25Z) - Beyond Single LLMs: Enhanced Code Generation via Multi-Stage Performance-Guided LLM Orchestration [12.674888937998086]
Large Language Models (LLMs) have become the predominant paradigm for automated code generation.<n>This paper challenges the single-model convention by introducing a multi-stage, performance-guided orchestration framework.<n>Perch orchestrates top-performing LLMs for each task context through stage-wise validation and rollback mechanisms.
arXiv Detail & Related papers (2025-10-01T19:07:16Z) - MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers [86.00932417210477]
We introduce MCP-Universe, the first comprehensive benchmark specifically designed to evaluate LLMs in realistic and hard tasks through interaction with real-world MCP servers.<n>Our benchmark encompasses 6 core domains spanning 11 different MCP servers: Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching.<n>We find that even SOTA models such as GPT-5 (43.72%), Grok-4 (33.33%) and Claude-4.0-Sonnet (29.44%) exhibit significant performance limitations.
arXiv Detail & Related papers (2025-08-20T13:28:58Z) - Dynamic Benchmark Construction for Evaluating Large Language Models on Real-World Codes [33.80591142965565]
We present CODE2BENCH, a pipeline for dynamically constructing robust and contamination-resistant benchmarks from real-world GitHub repositories.<n>Specifically, CODE2BENCH introduces three key innovations: (1) Automated Dynamism, achieved through periodic ingestion of recent code to minimize training data contamination; (2) Scope Graph-based dependency analysis, which enables structured classification of functions into benchmark instances with controlled dependency levels; and (3) Property-Based Testing (PBT) for the automated synthesis of rigorous test suites.
arXiv Detail & Related papers (2025-08-10T05:06:36Z) - OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks [52.87238755666243]
We present OmniEAR, a framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks.<n>We model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.<n>Our systematic evaluation reveals severe performance degradation when models must reason from constraints.
arXiv Detail & Related papers (2025-08-07T17:54:15Z) - Enterprise Large Language Model Evaluation Benchmark [3.8601502919298016]
Large Language Models (LLMs) have demonstrated promise in boosting productivity across AI-powered tools.<n>We propose a 14-task framework grounded in Bloom's taxonomy to holistically evaluate LLM capabilities in enterprise contexts.
arXiv Detail & Related papers (2025-06-25T09:34:25Z) - MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering [57.156093929365255]
Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents.<n>MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic engineering scenarios.<n>Its fully executable environment supports comprehensive agent training via both supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-05-12T17:35:43Z) - Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z) - COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks.
We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges.
Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z) - DeepCode AI Fix: Fixing Security Vulnerabilities with Large Language
Models [3.1690235522182104]
Large language models (LLMs) are increasingly used to solve various programming tasks.
We show that the task is difficult as it requires the model to learn long-range code relationships.
We propose a technique to address these challenges with a new approach for querying and fine-tuning LLMs.
arXiv Detail & Related papers (2024-02-19T18:35:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.