Fugu-MT 論文翻訳(概要): Learning to Draw ASCII Improves Spatial Reasoning in Language Models

論文の概要: Learning to Draw ASCII Improves Spatial Reasoning in Language Models

arxiv url: http://arxiv.org/abs/2604.14641v1
Date: Thu, 16 Apr 2026 05:42:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-17 21:29:31.742568
Title: Learning to Draw ASCII Improves Spatial Reasoning in Language Models
Title（参考訳）: ASCIIの描画学習は言語モデルにおける空間推論を改善する
Authors: Shiyuan Huang, Li Liu, Jincheng He, Leilani H. Gilpin,
Abstract要約: 我々はレイアウト構築のモデルをトレーニングし(Text$rightarrow$ASCII)、テキストだけで空間的推論を大幅に改善することを発見した。これらの改善は3つの外部空間推論ベンチマークに転送される。
参考スコア（独自算出の注目度）: 12.312689921390104
License: http://creativecommons.org/licenses/by/4.0/
Abstract: When faced with complex spatial problems, humans naturally sketch layouts to organize their thinking, and the act of drawing further sharpens their understanding. In this work, we ask whether a similar principle holds for Large Language Models (LLMs): can learning to construct explicit visual layouts from spatial descriptions instill genuine spatial understanding? We introduce Text2Space, a dataset that pairs natural language descriptions with ground-truth ASCII grid layouts and spatial QA pairs, enabling us to separate failures in constructing spatial representations from failures in reasoning over them. We adopt ASCII because it is human-readable, operates entirely within the token space of language models, and encodes spatial relations in a structurally verifiable form. Our evaluation reveals a pronounced "Read-Write Asymmetry": LLMs interpret ASCII representations effectively but struggle to produce them from text, and these construction errors propagate to incorrect answers downstream. To address this limitation, we train models on layout construction (Text$\rightarrow$ASCII) and find that it significantly improves spatial reasoning from text alone, even without producing any ASCII at inference time. Combining construction with comprehension training further amplifies these gains. Crucially, these improvements transfer to three external spatial reasoning benchmarks, demonstrating that, much as sketching sharpens human spatial thinking, learning to construct explicit layouts instills spatial understanding that generalizes beyond the training format.
Abstract（参考訳）: 複雑な空間的な問題に直面したとき、人間は自然にレイアウトをスケッチして思考を整理し、絵を描く行為は理解をさらに深めます。本研究では,Large Language Models (LLMs) に類似した原理が成立するかどうかを問う。自然言語記述とASCIIグリッドレイアウトと空間QAペアを組み合わせたデータセットであるText2Spaceを導入する。 ASCIIは可読性が高く、言語モデルのトークン空間内で完全に動作し、構造的に検証可能な形で空間関係を符号化する。 LLM は ASCII 表現を効果的に解釈するが、テキストから生成することが困難であり、これらの構造誤差は、下流の誤った回答に伝播する。この制限に対処するため、レイアウト構築モデル(Text$\rightarrow$ASCII)をトレーニングし、推論時にASCIIを生成することなく、テキスト単独で空間的推論を大幅に改善することを発見した。建設と理解トレーニングを組み合わせることで、これらの利益をさらに増幅する。重要な点として、これらの改善は3つの外部空間推論ベンチマークに移行し、スケッチ作成が人間の空間的思考を鋭くするのと同様に、明示的なレイアウトを構築することを学ぶことで、トレーニング形式を超えて一般化する空間的理解を浸透させることを示した。

関連論文リスト

3D-Layout-R1: Structured Reasoning for Language-Instructed Spatial Editing [57.785328880266775]
本研究では,シーングラフ推論によるテキスト条件付き空間レイアウト編集を行う構造化推論フレームワークを提案する。構造化された関係表現を通して推論過程を明示的に導くことにより,空間的関係の解釈可能性と制御性を向上する。
論文参考訳（メタデータ） (2026-03-23T17:59:14Z)
Stitch and Tell: A Structured Multimodal Data Augmentation Method for Spatial Understanding [23.444127854888578]
既存の視覚言語モデルは、しばしば空間幻覚に悩まされる。 $textStitchとTell$は、構造化された空間的監視をデータに注入する。 SiTeは、空間軸に沿って画像を縫合することで、縫合された画像テキストペアを構成する。
論文参考訳（メタデータ） (2025-12-07T10:07:59Z)
Seeing through Imagination: Learning Scene Geometry via Implicit Spatial World Modeling [68.14113731953971]
本稿では,人間のような想像力をシミュレートするインプリシット・スパットIaLwOrldモデリングパラダイムMILOを紹介する。提案手法は,複数のベースラインとベンチマークにまたがる空間推論能力を大幅に向上させることを示す。
論文参考訳（メタデータ） (2025-12-01T16:01:41Z)
SURPRISE3D: A Dataset for Spatial Understanding and Reasoning in Complex 3D Scenes [105.8644620467576]
Stextscurprise3Dは複雑な3次元シーンにおける言語誘導空間推論のセグメンテーションを評価するために設計された新しいデータセットである。 Stextscurprise3Dは、ScanNet++ v2から900以上の詳細な屋内シーンにわたる200k以上の視覚言語ペアで構成されている。データセットには、オブジェクト名なしで意図的に作成される89k以上の人間アノテーション付き空間クエリが含まれている。
論文参考訳（メタデータ） (2025-07-10T14:01:24Z)
A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding [78.99798110890157]
Open-vocabulary 3D visual groundingは、自由形式の言語クエリに基づいてターゲットオブジェクトをローカライズすることを目的としている。既存の言語フィールド手法は、言語クエリにおける空間的関係を利用してインスタンスを正確にローカライズするのに苦労する。本研究では,大規模言語モデル(LLM)に基づく空間推論を用いたニューラル表現に基づく新しいフレームワークであるSpatialReasonerを提案する。
論文参考訳（メタデータ） (2025-07-09T10:20:38Z)
Explicitly Representing Syntax Improves Sentence-to-layout Prediction of Unexpected Situations [21.636786771793364]
文の構文を暗黙的にあるいは明示的にエンコードする言語表現から2次元空間レイアウトを予測することができることを示す。本稿では,入力文の構文的構造をよりよく強制する構造的損失関数を提案する。この損失は、木のような構造が条件付けのモダリティの下にある他の世代タスクで使われる可能性がある。
論文参考訳（メタデータ） (2024-01-25T14:53:30Z)
Low-Dimensional Structure in the Space of Language Representations is Reflected in Brain Responses [62.197912623223964]
言語モデルと翻訳モデルは,単語の埋め込み,構文的・意味的タスク,将来的な単語埋め込みとの間を円滑に介在する低次元構造を示す。この表現埋め込みは、各特徴空間が、fMRIを用いて記録された自然言語刺激に対する人間の脳反応にどれだけうまく対応しているかを予測することができる。これは、埋め込みが脳の自然言語表現構造の一部を捉えていることを示唆している。
論文参考訳（メタデータ） (2021-06-09T22:59:12Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。