Fugu-MT 論文翻訳(概要): UA-Code-Bench: A Competitive Programming Benchmark for Evaluating LLM Code Generation in Ukrainian

論文の概要: UA-Code-Bench: A Competitive Programming Benchmark for Evaluating LLM Code Generation in Ukrainian

arxiv url: http://arxiv.org/abs/2511.05040v1
Date: Fri, 07 Nov 2025 07:24:56 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-10 21:00:44.70199
Title: UA-Code-Bench: A Competitive Programming Benchmark for Evaluating LLM Code Generation in Ukrainian
Title（参考訳）: UA-Code-Bench: ウクライナにおけるLLMコード生成評価のための競合プログラミングベンチマーク
Authors: Mykyta Syromiatnikov, Victoria Ruvinskaya,
Abstract要約: 本稿では,ウクライナにおける言語モデルのコード生成と競合するプログラム問題解決能力の徹底的な評価を目的とした,新しいオープンソースベンチマークであるUA-Code-Benchを紹介する。ベンチマークには、Eolympプラットフォームから500の問題が含まれており、非常に簡単なものから非常に難しいものまで、5つの複雑性レベルに均等に分散している。その結果,OpenAI o3 や GPT-5 のようなトップパフォーマンスモデルでさえ,その半分しか解けていないことがわかった。
参考スコア（独自算出の注目度）: 0.42970700836450487
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating the real capabilities of large language models in low-resource languages still represents a challenge, as many existing benchmarks focus on widespread tasks translated from English or evaluate only simple language understanding. This paper introduces UA-Code-Bench, a new open-source benchmark established for a thorough evaluation of language models' code generation and competitive programming problem-solving abilities in Ukrainian. The benchmark comprises 500 problems from the Eolymp platform, evenly distributed across five complexity levels from very easy to very hard. A diverse set of 13 leading proprietary and open-source models, generating Python solutions based on a one-shot prompt, was evaluated via the dedicated Eolymp environment against hidden tests, ensuring code correctness. The obtained results reveal that even top-performing models, such as OpenAI o3 and GPT-5, solve only half of the problems, highlighting the challenge of code generation in low-resource natural language. Furthermore, this research presents a comprehensive analysis of performance across various difficulty levels, as well as an assessment of solution uniqueness and computational efficiency, measured by both elapsed time and memory consumption of the generated solutions. In conclusion, this work demonstrates the value of competitive programming benchmarks in evaluating large language models, especially in underrepresented languages. It also paves the way for future research on multilingual code generation and reasoning-enhanced models. The benchmark, data parsing, preparation, code generation, and evaluation scripts are available at https://huggingface.co/datasets/NLPForUA/ua-code-bench.
Abstract（参考訳）: 低リソース言語における大規模言語モデルの実際の能力を評価することは、多くの既存のベンチマークが英語から翻訳された幅広いタスクに焦点を当てたり、単純な言語理解のみを評価するため、依然として課題である。本稿では,ウクライナにおける言語モデルのコード生成と競合するプログラム問題解決能力の徹底的な評価を目的とした,新しいオープンソースベンチマークであるUA-Code-Benchを紹介する。ベンチマークには、Eolympプラットフォームから500の問題が含まれており、非常に簡単なものから非常に難しいものまで、5つの複雑性レベルに均等に分散している。ワンショットプロンプトに基づいてPythonソリューションを生成する、プロプライエタリでオープンソースの13のさまざまなモデルセットが、隠されたテストに対して専用のEolymp環境を通じて評価され、コードの正しさが保証された。その結果,OpenAI o3 や GPT-5 のようなトップパフォーマンスモデルでさえ,その半分の問題を解き,低リソース自然言語におけるコード生成の課題を浮き彫りにした。さらに, 様々な難易度における性能の包括的解析を行い, 生成した解の経過時間とメモリ消費の両面から, 解の一意性と計算効率の評価を行った。結論として、この研究は、特に表現不足言語における大規模言語モデルの評価において、競合するプログラミングベンチマークの価値を示す。また、多言語コード生成と推論強化モデルに関する将来の研究の道を開く。ベンチマーク、データ解析、準備、コード生成、評価スクリプトはhttps://huggingface.co/datasets/NLPForUA/ua-code-bench.orgにある。

論文の概要: UA-Code-Bench: A Competitive Programming Benchmark for Evaluating LLM Code Generation in Ukrainian

関連論文リスト