Fugu-MT 論文翻訳(概要): Do You Get the Hint? Benchmarking LLMs on the Board Game Concept

論文の概要: Do You Get the Hint? Benchmarking LLMs on the Board Game Concept

arxiv url: http://arxiv.org/abs/2510.13271v1
Date: Wed, 15 Oct 2025 08:17:25 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-16 20:13:28.56585
Title: Do You Get the Hint? Benchmarking LLMs on the Board Game Concept
Title（参考訳）: ヒントを得たか? ボードゲームコンセプトのLSMのベンチマーク
Authors: Ine Gevers, Walter Daelemans,
Abstract要約: 大規模言語モデル(LLM)は多くのベンチマークで大きな成功を収めているが、最近の研究は根本的弱点を明らかにし続けている。本稿では,自然言語データに近い表現において帰納的推論を探索するためのベンチマークとして,単純な単語ゲースボードゲームであるConceptを紹介する。以上の結果から,このゲームは人間によって容易に解ける(成功率は90%を超えている)。
参考スコア（独自算出の注目度）: 1.671764884922859
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have achieved striking successes on many benchmarks, yet recent studies continue to expose fundamental weaknesses. In particular, tasks that require abstract reasoning remain challenging, often because they use representations such as grids, symbols, or visual patterns that differ from the natural language data LLMs are trained on. In this paper, we introduce Concept, a simple word-guessing board game, as a benchmark for probing abductive reasoning in a representation that is much closer to LLM pre-training data: natural language. Our results show that this game, easily solved by humans (with a success rate of over 90\%), is still very challenging for state-of-the-art LLMs (no model exceeds 40\% success rate). Specifically, we observe that LLMs struggle with interpreting other players' strategic intents, and with correcting initial hypotheses given sequential information updates. In addition, we extend the evaluation across multiple languages, and find that the LLM performance drops further in lower-resource languages (Dutch, French, and Spanish) compared to English.
Abstract（参考訳）: 大規模言語モデル(LLM)は多くのベンチマークで大きな成功を収めているが、最近の研究は根本的弱点を明らかにし続けている。特に、抽象的推論を必要とするタスクは、しばしばグリッド、シンボル、自然言語データとは異なる視覚パターンなどの表現を使用するため、難しいままである。本稿では,LLM事前学習データに近い表現である自然言語を用いて,帰納的推論を行うためのベンチマークとして,シンプルなワードゲスティングボードゲームであるConceptを紹介する。その結果、このゲームは人間によって容易に解ける(成功率は90%以上)が、現状のLLMでは依然として非常に難しい(成功率は40倍を超えない)ことが判明した。具体的には、LSMが他のプレイヤーの戦略意図の解釈に苦労し、逐次情報更新を行う初期仮説の修正に苦慮していることを観察する。さらに、複数の言語にまたがって評価を拡張し、低リソース言語(オランダ語、フランス語、スペイン語)ではLLMのパフォーマンスが英語と比較してさらに低下することがわかった。

論文の概要: Do You Get the Hint? Benchmarking LLMs on the Board Game Concept

関連論文リスト