KOCO-BENCH: Can Large Language Models Leverage Domain Knowledge in Software Development?
- URL: http://arxiv.org/abs/2601.13240v1
- Date: Mon, 19 Jan 2026 17:20:16 GMT
- Title: KOCO-BENCH: Can Large Language Models Leverage Domain Knowledge in Software Development?
- Authors: Xue Jiang, Jiaru Qian, Xianjie Shi, Chenjie Li, Hao Zhu, Ziyu Wang, Jielun Zhang, Zheyu Zhao, Kechi Zhang, Jia Li, Wenpin Jiao, Zhi Jin, Ge Li, Yihong Dong,
- Abstract summary: Large language models (LLMs) excel at general programming but struggle with domain-specific software development.<n>Existing domain-specific code benchmarks cannot evaluate the effectiveness of domain specialization methods.<n>We present KOCO-BENCH, a novel benchmark designed for evaluating domain specialization methods in real-world software development.
- Score: 58.85952408038657
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) excel at general programming but struggle with domain-specific software development, necessitating domain specialization methods for LLMs to learn and utilize domain knowledge and data. However, existing domain-specific code benchmarks cannot evaluate the effectiveness of domain specialization methods, which focus on assessing what knowledge LLMs possess rather than how they acquire and apply new knowledge, lacking explicit knowledge corpora for developing domain specialization methods. To this end, we present KOCO-BENCH, a novel benchmark designed for evaluating domain specialization methods in real-world software development. KOCO-BENCH contains 6 emerging domains with 11 software frameworks and 25 projects, featuring curated knowledge corpora alongside multi-granularity evaluation tasks including domain code generation (from function-level to project-level with rigorous test suites) and domain knowledge understanding (via multiple-choice Q&A). Unlike previous benchmarks that only provide test sets for direct evaluation, KOCO-BENCH requires acquiring and applying diverse domain knowledge (APIs, rules, constraints, etc.) from knowledge corpora to solve evaluation tasks. Our evaluations reveal that KOCO-BENCH poses significant challenges to state-of-the-art LLMs. Even with domain specialization methods (e.g., SFT, RAG, kNN-LM) applied, improvements remain marginal. Best-performing coding agent, Claude Code, achieves only 34.2%, highlighting the urgent need for more effective domain specialization methods. We release KOCO-BENCH, evaluation code, and baselines to advance further research at https://github.com/jiangxxxue/KOCO-bench.
Related papers
- Learning Domain Knowledge in Multimodal Large Language Models through Reinforcement Fine-Tuning [38.73465144699025]
We show that input-level domain knowledge injection yields little to no improvement on scientific multimodal tasks.<n>We propose a reinforcement fine-tuning framework that incorporates domain knowledge directly into the learning objective.
arXiv Detail & Related papers (2026-01-23T03:10:08Z) - MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks [56.34018316319873]
We propose MERA Code, a benchmark for evaluating code for the latest code generation LLMs in Russian.<n>This benchmark includes 11 evaluation tasks that span 8 programming languages.<n>We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages.
arXiv Detail & Related papers (2025-07-16T14:31:33Z) - Top General Performance = Top Domain Performance? DomainCodeBench: A Multi-domain Code Generation Benchmark [38.14474956762422]
We introduce DomainCodeBench, a benchmark designed to evaluate large language models (LLMs) across 12 software application domains and 15 programming languages.<n>We find that top general-domain models do not consistently excel in specific application domains.<n>We show that augmenting prompts with domain-specific knowledge improves performance by around 38.17%.
arXiv Detail & Related papers (2024-12-24T17:56:08Z) - Learning to Solve Domain-Specific Calculation Problems with Knowledge-Intensive Programs Generator [33.680619900836376]
We propose a pipeline to solve the domain-specific calculation problems with Knowledge-Intensive Programs Generator.<n>It generates knowledge-intensive programs according to the domain-specific documents.<n>We also find that the code generator is also adaptable to other domains, without training on the new knowledge.
arXiv Detail & Related papers (2024-12-12T13:42:58Z) - EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations [87.34429475432998]
Existing benchmarks have two limitations - data leakage and lack of domain-specific evaluation.
EvoCodeBench will be dynamically updated every period (e.g., 6 months) to avoid data leakage.
This paper releases the first version - EvoCodeBench-2403, containing 275 samples from 25 repositories.
arXiv Detail & Related papers (2024-10-30T08:57:59Z) - On the Effectiveness of Large Language Models in Domain-Specific Code Generation [20.61882220430463]
Large language models (LLMs) such as ChatGPT have shown remarkable capabilities in code generation.<n>We investigate how to effectively incorporate API knowledge into the code generation process.<n>We refer to these strategies as a new code generation approach called DomCoder.
arXiv Detail & Related papers (2023-12-04T05:41:02Z) - Knowledge Plugins: Enhancing Large Language Models for Domain-Specific
Recommendations [50.81844184210381]
We propose a general paradigm that augments large language models with DOmain-specific KnowledgE to enhance their performance on practical applications, namely DOKE.
This paradigm relies on a domain knowledge extractor, working in three steps: 1) preparing effective knowledge for the task; 2) selecting the knowledge for each specific sample; and 3) expressing the knowledge in an LLM-understandable way.
arXiv Detail & Related papers (2023-11-16T07:09:38Z) - Domain Specialization as the Key to Make Large Language Models Disruptive: A Comprehensive Survey [100.24095818099522]
Large language models (LLMs) have significantly advanced the field of natural language processing (NLP)
They provide a highly useful, task-agnostic foundation for a wide range of applications.
However, directly applying LLMs to solve sophisticated problems in specific domains meets many hurdles.
arXiv Detail & Related papers (2023-05-30T03:00:30Z) - Prior Knowledge Guided Unsupervised Domain Adaptation [82.9977759320565]
We propose a Knowledge-guided Unsupervised Domain Adaptation (KUDA) setting where prior knowledge about the target class distribution is available.
In particular, we consider two specific types of prior knowledge about the class distribution in the target domain: Unary Bound and Binary Relationship.
We propose a rectification module that uses such prior knowledge to refine model generated pseudo labels.
arXiv Detail & Related papers (2022-07-18T18:41:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.