David vs. Goliath: A comparative study of different-sized LLMs for code generation in the domain of automotive scenario generation
- URL: http://arxiv.org/abs/2510.14115v1
- Date: Wed, 15 Oct 2025 21:37:02 GMT
- Title: David vs. Goliath: A comparative study of different-sized LLMs for code generation in the domain of automotive scenario generation
- Authors: Philipp Bauerfeind, Amir Salarpour, David Fernandez, Pedram MohajerAnsari, Johannes Reschke, Mert D. Pesé,
- Abstract summary: NL-to-Scenic generation with large language models (LLMs) suffers from scarce data, limited metrics.<n>We introduce NL2Scenic, an open dataset and framework with 146 NL/Scenic pairs, a difficulty-stratified 30-case test split, and an Example Retriever.<n>We evaluate 13 models: four proprietary (GPT-4o, GPT-5, Claude-Sonnet-4, Gemini-2.5-pro) and nine open-source code models (Qwen2.5Coder 0.5B-32B; CodeLlama 7B/13B/34B)
- Score: 1.6752458252726459
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scenario simulation is central to testing autonomous driving systems. Scenic, a domain-specific language (DSL) for CARLA, enables precise and reproducible scenarios, but NL-to-Scenic generation with large language models (LLMs) suffers from scarce data, limited reproducibility, and inconsistent metrics. We introduce NL2Scenic, an open dataset and framework with 146 NL/Scenic pairs, a difficulty-stratified 30-case test split, an Example Retriever, and 14 prompting variants (ZS, FS, CoT, SP, MoT). We evaluate 13 models: four proprietary (GPT-4o, GPT-5, Claude-Sonnet-4, Gemini-2.5-pro) and nine open-source code models (Qwen2.5Coder 0.5B-32B; CodeLlama 7B/13B/34B), using text metrics (BLEU, ChrF, EDIT-SIM, CrystalBLEU) and execution metrics (compilation and generation), and compare them with an expert study (n=11). EDIT-SIM correlates best with human judgments; we also propose EDIT-COMP (F1 of EDIT-SIM and compilation) as a robust dataset-level proxy that improves ranking fidelity. GPT-4o performs best overall, while Qwen2.5Coder-14B reaches about 88 percent of its expert score on local hardware. Retrieval-augmented prompting, Few-Shot with Example Retriever (FSER), consistently boosts smaller models, and scaling shows diminishing returns beyond mid-size, with Qwen2.5Coder outperforming CodeLlama at comparable scales. NL2Scenic and EDIT-COMP offer a standardized, reproducible basis for evaluating Scenic code generation and indicate that mid-size open-source models are practical, cost-effective options for autonomous-driving scenario programming.
Related papers
- HuggingR$^{4}$: A Progressive Reasoning Framework for Discovering Optimal Model Companions [50.61510609116118]
HuggingR$4$ is a novel framework that combines Reasoning, Retrieval, Refinement, and Reflection to efficiently select models.<n>It attains a workability rate of 92.03% and a reasonability rate of 82.46%, surpassing existing method by 26.51% and 33.25% respectively.
arXiv Detail & Related papers (2025-11-24T03:13:45Z) - Lifecycle-Aware code generation: Leveraging Software Engineering Phases in LLMs [12.70863561286374]
We introduce a lifecycle-aware framework that incorporates intermediate artifacts into both the training and inference stages.<n> Experiments show that lifecycle-level fine-tuning improves code correctness by up to 75% over the same model before fine-tuning.<n>Open-source LLMs, once fine-tuned under our framework, match or slightly outperform models pretrained on code.
arXiv Detail & Related papers (2025-10-28T02:54:02Z) - Datarus-R1: An Adaptive Multi-Step Reasoning LLM for Automated Data Analysis [0.0]
We present Datarus-R1-14B, a language model fine-tuned from Qwen 2.5-14B-Instruct to act as a virtual data analyst and graduate-level problem solver.<n>Datarus is trained not on isolated question-answer pairs but on full analytical trajectories including reasoning steps, code execution, error traces, self-corrections, and final conclusions.
arXiv Detail & Related papers (2025-08-18T21:58:18Z) - A Hierarchical and Evolvable Benchmark for Fine-Grained Code Instruction Following with Multi-Turn Feedback [30.446511584123492]
Large language models (LLMs) have advanced significantly in code generation, yet their ability to follow complex programming instructions with layered and diverse constraints remains underexplored.<n>We introduce MultiCodeIF, a comprehensive benchmark designed to evaluate instruction-following in code generation across multiple dimensions.<n>We synthesize and evolve 2,021 code tasks sourced from 14 programming languages, supporting multi-turn evaluation through feedback-driven task variants.
arXiv Detail & Related papers (2025-07-01T11:51:40Z) - Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications [0.7124971549479361]
This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification.<n>We determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability.
arXiv Detail & Related papers (2025-05-20T21:12:58Z) - Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks.<n>However, improvement is plateauing due to the exhaustion of readily available high-quality data.<n>We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z) - S*: Test Time Scaling for Code Generation [55.11863577956177]
We propose S*, the first hybrid test-time scaling framework for code generation.<n>S* substantially improves the coverage and selection accuracy of generated code.
arXiv Detail & Related papers (2025-02-20T09:18:53Z) - UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge.<n>We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process.<n>Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z) - NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts [31.783388267874738]
We propose NaturalCodeBench (NCB), a challenging code benchmark designed to mirror the complexity and variety of scenarios in real coding tasks.
NCB comprises 402 high-quality problems in Python and Java, meticulously selected from natural user queries from online coding services.
Our systematic experiments on 39 LLMs find that performance gaps on NCB between models with close HumanEval scores could still be significant.
arXiv Detail & Related papers (2024-05-07T17:52:51Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM.
For learning methods, we explore the claim of a "free lunch" hypothesis.
For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.