Fugu-MT 論文翻訳(概要): KoSimpleQA: A Korean Factuality Benchmark with an Analysis of Reasoning LLMs

論文の概要: KoSimpleQA: A Korean Factuality Benchmark with an Analysis of Reasoning LLMs

arxiv url: http://arxiv.org/abs/2510.18368v1
Date: Tue, 21 Oct 2025 07:37:51 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:13.13495
Title: KoSimpleQA: A Korean Factuality Benchmark with an Analysis of Reasoning LLMs
Title（参考訳）: KoSimpleQA: 推論LDMの分析による韓国のファクチュアリティベンチマーク
Authors: Donghyeon Ko, Yeguk Jin, Kyubyung Chae, Byungwook Lee, Chansong Jo, Sookyo In, Jaehong Lee, Taesup Kim, Donghyun Kwak,
Abstract要約: 大規模言語モデル(LLM)における事実性評価のベンチマークであるKoSimpleQAを提案する。 KoSimpleQAは、1000の短い事実を探す質問と明確な答えからなる、難解で簡単に評価できるように設計されている。
参考スコア（独自算出の注目度）: 17.444084723682675
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present $\textbf{Korean SimpleQA (KoSimpleQA)}$, a benchmark for evaluating factuality in large language models (LLMs) with a focus on Korean cultural knowledge. KoSimpleQA is designed to be challenging yet easy to grade, consisting of 1,000 short, fact-seeking questions with unambiguous answers. We conduct a comprehensive evaluation across a diverse set of open-source LLMs of varying sizes that support Korean, and find that even the strongest model generates correct answer only 33.7% of the time, underscoring the challenging nature of KoSimpleQA. Notably, performance rankings on KoSimpleQA differ substantially from those on the English SimpleQA, highlighting the unique value of our dataset. Furthermore, our analysis of reasoning LLMs shows that engaging reasoning capabilities in the factual QA task can both help models better elicit their latent knowledge and improve their ability to abstain when uncertain. KoSimpleQA can be found at https://anonymous.4open.science/r/KoSimpleQA-62EB.
Abstract（参考訳）: 韓国の文化知識に着目した大規模言語モデル(LLM)の事実性を評価するためのベンチマークである,$\textbf{Korean SimpleQA (KoSimpleQA)$を提示した。 KoSimpleQAは、1000の短い事実を探す質問と明確な答えからなる、難解で簡単に評価できるように設計されている。我々は,韓国をサポートする様々な規模のオープンソースLLMを網羅的に評価し,最強のモデルでも33.7%の正解しか得られず,KoSimpleQAの挑戦的な性質を裏付けている。特に、KoSimpleQAのパフォーマンスランキングは、我々のデータセットのユニークな価値を強調しながら、英語のSimpleQAと大きく異なる。さらに,本論文では,実際のQA課題における係わる推論能力は,モデルが潜在知識を引き出すのに役立つとともに,不確実な場合の抑止能力を向上させることが示唆された。 KoSimpleQAはhttps://anonymous.4open.science/r/KoSimpleQA-62EBで見ることができる。

関連論文リスト

Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering [73.73820209993515]
KoLasSimpleQAは,Large Language Models (LLMs) の多言語事実能力を評価する最初のベンチマークである。既存の研究に触発されて、単一知識点カバレッジ、絶対的客観性、独特な答え、時間的安定性といった特徴を備えた質問セットを作成しました。その結果,2つの領域間に大きな性能差が認められた。
論文参考訳（メタデータ） (2025-05-22T12:27:02Z)
Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
大規模言語モデル(LLM)の評価には,MCQA(Multiple-Choice Question Answering)が広く用いられている。報告されたLCMの性能には,複数の要因が大きな影響を及ぼす可能性が示唆された。既存の回答抽出手法が人間の判断と一致しているかどうかを解析する。
論文参考訳（メタデータ） (2025-03-19T08:45:03Z)
VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering [28.045285777736876]
2つの重要な特徴を持つマルチモーダルなファクト検索ベンチマークであるVisualSimpleQAを紹介する。視覚的・言語的モダリティにおけるLVLMの合理化・分離評価を可能にする。 15個のLVLMの実験では、GPT-4oのような最先端のモデルでさえ、わずか60%以上の精度しか達成していない。
論文参考訳（メタデータ） (2025-03-09T07:25:32Z)
Grounding-IQA: Grounding Multimodal Language Model for Image Quality Assessment [76.77693558769934]
本稿では,新しい画像品質評価(IQA)タスクパラダイム**grounding-IQA*を紹介する。このパラダイムはマルチモーダル参照とグラウンドをIQAと統合し、よりきめ細かい品質知覚を実現する。我々は,GIQA-Benchというよく設計されたベンチマークを開発した。このベンチマークは,記述品質,VQA精度,グラウンド化精度の3点から,グラウンド化-IQA性能を評価する。
論文参考訳（メタデータ） (2024-11-26T09:03:16Z)
Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models [24.47838086336772]
中国語SimpleQAは、短い質問に答える言語モデルの事実性を評価する最初の包括的な中国のベンチマークである。私たちは、99の多様なサブトピックを持つ6つの主要なトピックに関する中国語に焦点を当てています。
論文参考訳（メタデータ） (2024-11-11T17:10:56Z)
PEDANTS: Cheap but Effective and Interpretable Answer Equivalence [10.367359022491181]
我々は,Triviaコミュニティで採用されているマシンQAを評価するために,ルーリックとデータセットを提供する。また、正確なマッチングとニューラルメソッド(BERTScore)よりも安定な、効率的で解釈可能なQA評価を提案する。
論文参考訳（メタデータ） (2024-02-17T01:56:19Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。