Fugu-MT 論文翻訳(概要): LTD-Bench: Evaluating Large Language Models by Letting Them Draw

論文の概要: LTD-Bench: Evaluating Large Language Models by Letting Them Draw

arxiv url: http://arxiv.org/abs/2511.02347v1
Date: Tue, 04 Nov 2025 08:11:23 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 18:47:05.846097
Title: LTD-Bench: Evaluating Large Language Models by Letting Them Draw
Title（参考訳）: LTD-Bench: テーマの描画による大規模言語モデルの評価
Authors: Liuhao Lin, Ke Li, Zihan Xu, Yuchen Shi, Yulei Qin, Yan Zhang, Xing Sun, Rongrong Ji,
Abstract要約: LTD-Benchは、大規模言語モデル(LLM)のブレークスルーベンチマークである。 LLMの評価を抽象的なスコアから直接観察可能な視覚出力に変換する。 LTD-Benchの視覚出力は強力な診断分析を可能にし、モデル類似性を調べるための潜在的アプローチを提供する。
参考スコア（独自算出の注目度）: 57.237152905238084
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current evaluation paradigms for large language models (LLMs) represent a critical blind spot in AI research--relying on opaque numerical metrics that conceal fundamental limitations in spatial reasoning while providing no intuitive understanding of model capabilities. This deficiency creates a dangerous disconnect between reported performance and practical abilities, particularly for applications requiring physical world understanding. We introduce LTD-Bench, a breakthrough benchmark that transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code. This approach makes spatial reasoning limitations immediately apparent even to non-experts, bridging the fundamental gap between statistical performance and intuitive assessment. LTD-Bench implements a comprehensive methodology with complementary generation tasks (testing spatial imagination) and recognition tasks (assessing spatial perception) across three progressively challenging difficulty levels, methodically evaluating both directions of the critical language-spatial mapping. Our extensive experiments with state-of-the-art models expose an alarming capability gap: even LLMs achieving impressive results on traditional benchmarks demonstrate profound deficiencies in establishing bidirectional mappings between language and spatial concept--a fundamental limitation that undermines their potential as genuine world models. Furthermore, LTD-Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.
Abstract（参考訳）: 大規模言語モデル(LLM)の現在の評価パラダイムは、AI研究における重要な盲点である。この欠陥は、特に物理世界理解を必要とするアプリケーションにおいて、報告された性能と実用能力の間に危険な断絶をもたらす。我々は,LCM評価を抽象的なスコアから直接観測可能な視覚出力に変換する,点行列や実行可能コードによる描画をモデルに要求することで,ブレークスルーベンチマークであるLTD-Benchを紹介した。このアプローチは、統計的性能と直感的評価の基本的なギャップを埋める、非専門家にも空間的推論の制限を直ちに明らかにする。 LTD-Benchは、3つの難易度を総合的に比較し、言語空間マッピングの両方向を体系的に評価し、相補的生成タスク(空間的想像力をテストする)と認識タスク(空間的知覚を評価する)で包括的方法論を実装した。従来のベンチマークで見事な結果を得るLLMでさえ、言語と空間概念の双方向マッピングを確立する上で、重大な欠陥を示します。さらに、LTD-Benchの視覚出力は強力な診断分析を可能にし、モデル類似性を調べるための潜在的アプローチを提供する。

論文の概要: LTD-Bench: Evaluating Large Language Models by Letting Them Draw

関連論文リスト