Fugu-MT 論文翻訳(概要): NSMQ Riddles: A Benchmark of Scientific and Mathematical Riddles for Quizzing Large Language Models

論文の概要: NSMQ Riddles: A Benchmark of Scientific and Mathematical Riddles for Quizzing Large Language Models

arxiv url: http://arxiv.org/abs/2605.07051v1
Date: Fri, 08 May 2026 00:00:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:38.682155
Title: NSMQ Riddles: A Benchmark of Scientific and Mathematical Riddles for Quizzing Large Language Models
Title（参考訳）: NSMQ Riddles: 大規模言語モデルのクイズのための科学および数学的なRiddlesのベンチマーク
Authors: George Boateng, Naafi Ibrahim, Samuel John, Philemon Badu, Patrick Agyeman-Budu, Jonathan Mensah, Kevin Yeboah, William Edor, Andrew Mensa-Onumah, Nana Yeboah, Victor Wumbor-Apin Kumbol,
Abstract要約: 大規模言語モデル(LLM)は、様々な科学教育ベンチマークで優れた性能を示している。 LLMは西洋のデータセットで評価される傾向がある。 NSMQ Riddlesは、ガーナの国立科学・数学クイズ(NSMQ)コンペティションの科学と数学のリドルズの新たなベンチマークである。
参考スコア（独自算出の注目度）: 0.15274583259797847
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large Language Models (LLMs) have shown good performance on various science educational benchmarks, demonstrating their potential for use in science and mathematics education. Yet, LLMs tend to be evaluated on science and mathematical educational datasets from the Western world, with an underrepresentation of datasets from the Global South. Furthermore, they tend to have multiple-choice answer options that are trivial to evaluate. In this work, we present NSMQ Riddles, a novel benchmark of Scientific and Mathematical Riddles from Ghana's National Science and Maths Quiz (NSMQ) competition to evaluate LLMs. The NSMQ is an annual live TV competition for senior secondary school students in Ghana that brings together the smartest high school students in Ghana who compete in teams of 2 by answering questions in biology, chemistry, physics, and math over five rounds and five stages until a winning team is crowned for that year. NSMQ Riddles consists of 11 years of riddle questions (n=1.8K) from the 5th round, with each riddle containing a minimum of 3 clues. Students compete to be the first to guess the answer on any of the clues, with earlier clues being vague and also fetching more points. The answers are usually a number, word, or short phrase, allowing for automatic evaluation. We evaluated state-of-the-art models: closed (GPT-5.4, Gemini 3.1 Pro, Claude Opus 4.6) and open models (Kimi-K2.5, DeepSeek-V3.1, GPT-OSS-120B) with high and low reasoning settings. Our evaluation shows that the dataset is challenging even for state-of-the-art LLMs, which performed worse than the best student contestants. This work contributes a novel and challenging benchmark for scientific and mathematical reasoning from the Global South towards enabling a true global benchmarking of LLMs' capabilities for science and mathematics education.
Abstract（参考訳）: 大規模言語モデル(LLM)は様々な科学教育ベンチマークで優れた性能を示し、科学と数学の教育での利用の可能性を示している。しかし、LLMは西洋の科学と数学の教育データセットで評価されがちであり、グローバル・サウスからのデータセットの不足が原因である。さらに、評価しにくい複数の選択肢がある傾向があります。本稿では,ガーナの国立科学・数学クイズ(NSMQ)コンペティションにおける科学と数学の枠組みの新たなベンチマークであるNSMQ Riddlesを紹介し,LLMを評価した。 NSMQ(NSMQ)は、ガーナの高校生のための年次ライブTVコンペティションで、生物学、化学、物理学、数学の質問に答えて2人のチームで競う、ガーナで最も賢い高校生を集めている。 NSMQ Riddlesは、第5ラウンドから11年の謎解き(n=1.8K)で構成され、それぞれの謎には最低3つの手がかりが含まれている。学生は、どの手がかりについても最初に答えを推測するために競い合っており、それ以前の手がかりは曖昧で、より多くのポイントを取得する。答えは通常数字、単語、短いフレーズで、自動評価が可能である。クローズドモデル(GPT-5.4, Gemini 3.1 Pro, Claude Opus 4.6)とオープンモデル(Kimi-K2.5, DeepSeek-V3.1, GPT-OSS-120B)を高い推論条件で評価した。評価の結果,このデータセットは,優秀な学生よりも成績が悪く,最先端のLLMでも難易度が高いことがわかった。この研究は、科学と数学の教育におけるLLMの真のグローバルなベンチマークを可能にするための、グローバル・サウスからの科学と数学の推論のための、新しくて挑戦的なベンチマークに貢献している。

論文の概要: NSMQ Riddles: A Benchmark of Scientific and Mathematical Riddles for Quizzing Large Language Models

関連論文リスト