Comprehensive Evaluation of Large Language Models on Software Engineering Tasks: A Multi-Task Benchmark
- URL: http://arxiv.org/abs/2602.07079v1
- Date: Fri, 06 Feb 2026 03:30:19 GMT
- Title: Comprehensive Evaluation of Large Language Models on Software Engineering Tasks: A Multi-Task Benchmark
- Authors: Go Frendi Gunawan, Mukhlis Amien,
- Abstract summary: Large Language Models (LLMs) have demonstrated remarkable capabilities in software engineering.<n>We present a multi-task evaluation of 11 state-of-the-art LLMs across five representative software engineering tasks.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in software engineering, yet comprehensive benchmarks covering diverse SE activities remain limited. We present a multi-task evaluation of 11 state-of-the-art LLMs across five representative software engineering tasks: bug fixing, feature development, code refactoring, technical copywriting, and research synthesis. Our automated verification framework measures both output quality and completion efficiency. Key findings reveal that (1) models achieving identical perfect scores exhibit 22x variation in completion time, 49x variation in tool efficiency, and 53x variation in estimated cost; (2) tool usage frequency shows no correlation with success (r = 0.077, p = 0.575) - one model used 917 tool calls while another solved the same task with 3 calls; (3) we identify two distinct inefficiency patterns: loop inefficiency and inference inefficiency; and (4) coding tasks achieve 100 percent success while research tasks present greater challenges (90.9 percent). We release all experimental data, verification scripts, and analysis code for full reproducibility.
Related papers
- Failure-Aware Enhancements for Large Language Model (LLM) Code Generation: An Empirical Study on Decision Framework [0.26508608365976566]
In an empirical study of 25 GitHub projects, we found that progressive prompting achieves 96.9% average task completion.<n>Self-critique succeeds on code-reviewable logic errors but fails completely on external service integration.<n>RAG achieves highest completion across all failure types with superior efficiency.
arXiv Detail & Related papers (2026-02-02T23:08:03Z) - Parameter-Efficient Multi-Task Fine-Tuning in Code-Related Tasks [4.347703075408796]
We investigate Multi-task QLoRA fine-tuning across three representative tasks: code generation, translation, and summarization.<n>Our findings show that Multi-task QLoRA effectively leverages transfer learning, achieving competitive or superior performance.<n>Larger models demonstrate more consistent balance between correctness and quality, whereas smaller models preserve functionality but exhibit a higher incidence of quality-related issues.
arXiv Detail & Related papers (2026-01-21T15:33:16Z) - AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent [80.83250816918861]
Large Reasoning Models (LRMs) like o3 and DeepSeek-R1 have achieved remarkable progress in natural language reasoning with long chain-of-thought.<n>However, they remain computationally inefficient and struggle with accuracy when solving problems requiring complex mathematical operations.<n>We present AgentMath, an agent framework that seamlessly integrates language models' reasoning capabilities with code interpreters' computational precision.
arXiv Detail & Related papers (2025-12-23T19:57:49Z) - LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering [90.84806758077536]
We introduce textbfLoCoBench-Agent, a comprehensive evaluation framework specifically designed to assess large language models (LLMs) agents in realistic, long-context software engineering.<n>Our framework extends LoCoBench's 8,000 scenarios into interactive agent environments, enabling systematic evaluation of multi-turn conversations.<n>Our framework provides agents with 8 specialized tools (file operations, search, code analysis) and evaluates them across context lengths ranging from 10K to 1M tokens.
arXiv Detail & Related papers (2025-11-17T23:57:24Z) - The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution [86.4588675093384]
Toolathlon is a benchmark for language agents offering diverse Apps and tools, realistic environment setup, and reliable execution-based evaluation.<n>This benchmark includes 108 manually sourced or crafted tasks, requiring interacting with multiple Apps over around 20 turns on average to complete.<n>We expect Toolathlon to drive the development of more capable language agents for real-world, long-horizon task execution.
arXiv Detail & Related papers (2025-10-29T17:32:49Z) - OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks [52.87238755666243]
We present OmniEAR, a framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks.<n>We model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.<n>Our systematic evaluation reveals severe performance degradation when models must reason from constraints.
arXiv Detail & Related papers (2025-08-07T17:54:15Z) - Chain of Draft for Software Engineering: Challenges in Applying Concise Reasoning to Code Tasks [0.0]
This research extends the Chain of Draft (CoD) method to software engineering.<n>All CoD variants used significantly fewer tokens than Chain of Thought (CoT)<n>CoD variants maintain over 90% of CoT's code quality across key metrics including correctness, compatibility, and maintainability.
arXiv Detail & Related papers (2025-03-12T07:44:18Z) - Code Review Automation Via Multi-task Federated LLM -- An Empirical Study [4.8342038441006805]
The study explores five simple techniques for multi-task training, including two sequential methods, one parallel method, and two cumulative methods.<n>The results indicate that sequentially training a federated LLM (FedLLM) for our code review multi-task use case is less efficient in terms of time, computation, and performance metrics, compared to training separate models for each task.
arXiv Detail & Related papers (2024-12-20T08:46:46Z) - BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions [72.56339136017759]
We introduce BigCodeBench, a benchmark that challenges Large Language Models (LLMs) to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks.<n>Our evaluation shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%.<n>We propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information.
arXiv Detail & Related papers (2024-06-22T15:52:04Z) - Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment [24.2553229691479]
Large language and multimodal models have shown remarkable success on various benchmarks focused on specific skills.<n>However, it is unclear how well these models perform on tasks that require a combination of these skills.<n>In this paper, we curate a novel program synthesis benchmark based on the real-world tasks in the XLogoOnline visual programming environment.
arXiv Detail & Related papers (2024-06-17T08:48:02Z) - NEVIS'22: A Stream of 100 Tasks Sampled from 30 Years of Computer Vision
Research [96.53307645791179]
We introduce the Never-Ending VIsual-classification Stream (NEVIS'22), a benchmark consisting of a stream of over 100 visual classification tasks.
Despite being limited to classification, the resulting stream has a rich diversity of tasks from OCR, to texture analysis, scene recognition, and so forth.
Overall, NEVIS'22 poses an unprecedented challenge for current sequential learning approaches due to the scale and diversity of tasks.
arXiv Detail & Related papers (2022-11-15T18:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.