Related papers: Cross-Task Benchmarking and Evaluation of General-Purpose and Code-Specific Large Language Models

Cross-Task Benchmarking and Evaluation of General-Purpose and Code-Specific Large Language Models

URL: http://arxiv.org/abs/2512.04673v1
Date: Thu, 04 Dec 2025 11:06:33 GMT
Title: Cross-Task Benchmarking and Evaluation of General-Purpose and Code-Specific Large Language Models
Authors: Gunjan Das, Paheli Bhattacharya, Rishabh Gupta,
Abstract summary: Large Language Models (LLMs) have revolutionized both general natural language processing and domain-specific applications such as code synthesis, legal reasoning, and finance.<n>We present a comprehensive evaluation of five general-purpose and three code-specific state-of-the-art LLMs across six diverse benchmarks.<n>Our findings reveal that models optimized for code (e.g., CodeLLaMA variants) exhibit strong reasoning and syntactic precision, that even for non-coding tasks can show measurable performance gains.
Score: 3.603673783661375
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have revolutionized both general natural language processing and domain-specific applications such as code synthesis, legal reasoning, and finance. However, while prior studies have explored individual model capabilities, a systematic cross-domain comparison that unifies linguistic, reasoning, and code understanding abilities remains underexplored. In this work, we present a comprehensive evaluation of five general-purpose and three code-specific state-of-the-art LLMs across six diverse benchmarks encompassing linguistic competence, mathematical reasoning, and trustworthiness. Additionally, we analyze model behavior on the CoNaLa dataset for code explanation, comparing natural language and code-specialized LLMs. Our findings reveal that models optimized for code (e.g., CodeLLaMA variants) exhibit strong reasoning and syntactic precision, that even for non-coding tasks can show measurable performance gains, in contrast to general-purpose models like Mistral-7B and Llama-3-8B.

Related papers

CodeSimpleQA: Scaling Factuality in Code Large Language Models [55.705748501461294]
We present CodeSimpleQA, a comprehensive benchmark designed to evaluate the factual accuracy of code LLMs in answering code-related questions.<n>We also create CodeSimpleQA-Instruct, a large-scale instruction corpus with 66M samples, and develop a post-training framework combining supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-12-22T14:27:17Z)
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence [150.3696990310269]
Large language models (LLMs) have transformed automated software development by enabling direct translation of natural language descriptions into functional code.<n>We provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs.<n>We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder)
arXiv Detail & Related papers (2025-11-23T17:09:34Z)
On Code-Induced Reasoning in LLMs [21.875805779552564]
We construct parallel instruction datasets in ten programming languages.<n>We apply controlled perturbations that selectively disrupt structural or semantic properties of code.<n>Across 3,331 experiments, our results show that LLMs are more vulnerable to structural perturbations than semantic ones.
arXiv Detail & Related papers (2025-09-25T19:57:36Z)
Evaluating Large Language Models for Functional and Maintainable Code in Industrial Settings: A Case Study at ASML [3.5515013986822073]
We present a case study conducted in collaboration with the leveling department at A.<n>We investigate the performance of LLMs in generating functional, maintainable code within a closed, highly specialized software environment.<n>The findings reveal that prompting techniques and model size have a significant impact on output quality.
arXiv Detail & Related papers (2025-09-15T19:39:26Z)
mSCoRe: a $M$ultilingual and Scalable Benchmark for $S$kill-based $Co$mmonsense $Re$asoning [74.97363626515236]
We propose a textbfMultilingual and Scalable Benchmark for textbfSkill-based textbfCommonsense textbfReasoning (textbfmSCoRe)<n>Our benchmark incorporates three key components that are designed to systematically evaluate LLM's reasoning capabilities.<n>Our results reveal the limitations of such reasoning-reinforced models when confronted with nuanced multilingual general and cultural commonsense.
arXiv Detail & Related papers (2025-08-13T18:59:02Z)
IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z)
ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts [8.181151553582488]
textttScholarBench is a benchmark for evaluating the academic reasoning ability of large language models (LLMs)<n>The benchmark comprises 5,031 examples in Korean and 5,309 in English, with even state-of-the-art models like o3-mini achieving an average evaluation score of only 0.543.
arXiv Detail & Related papers (2025-05-22T11:59:06Z)
EquiBench: Benchmarking Large Language Models' Reasoning about Program Semantics via Equivalence Checking [58.15568681219339]
We introduce EquiBench, a new benchmark for evaluating large language models (LLMs)<n>This task directly tests a model's ability to reason about program semantics.<n>We evaluate 19 state-of-the-art LLMs and find that in the most challenging categories, the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline.
arXiv Detail & Related papers (2025-02-18T02:54:25Z)
Analysis on LLMs Performance for Code Summarization [0.0]
Large Language Models (LLMs) have significantly advanced the field of code summarization.<n>This study aims to perform a comparative analysis of several open-source LLMs, namely LLaMA-3, Phi-3, Mistral, and Gemma.
arXiv Detail & Related papers (2024-12-22T17:09:34Z)
MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration [98.18244218156492]
Large Language Models (LLMs) have significantly advanced natural language processing.<n>As their applications expand into multi-agent environments, there arises a need for a comprehensive evaluation framework.<n>This work introduces a novel competition-based benchmark framework to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z)
L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs) We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods. In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z)
Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X. These datasets cover over 10 programming languages. We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.