Fugu-MT 論文翻訳(概要): ASCII Art Turns LLMs into VLA Controllers

論文の概要: ASCII Art Turns LLMs into VLA Controllers

arxiv url: http://arxiv.org/abs/2606.21470v1
Date: Fri, 19 Jun 2026 14:19:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-25 13:12:58.118041
Title: ASCII Art Turns LLMs into VLA Controllers
Title（参考訳）: LLMをVLAコントローラに変えるASCIIアート
Authors: Yitao Jiang, Roy Xing, Luyang Zhao, Brian Plancher, Muhao Chen, Devin Balkcom,
Abstract要約: VLA(Vision-Language-Action)コントローラは、視覚言語モデル(VLM)を拡張して動作の監視を行う。我々は,テキストのみの大規模言語モデル (LLM) が,視覚的観察をテキスト入力に変換する際に,VLAスタイルのコントローラに適応できることを実証した。
参考スコア（独自算出の注目度）: 24.00781789697405
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision--Language--Action (VLA) controllers are often built by extending vision--language models (VLMs) with action supervision, relying on multimodal backbones with large data and compute requirements. We demonstrate that a text-only large language model (LLM) can be adapted into a VLA-style controller when visual observations are rendered into a text input using an ASCII representation. This ASCII-as-vision interface enables existing training and deployment stacks for LLMs to efficiently condition on visual state, follow natural-language instructions, and produce constrained, executable actions. We fine-tune and compare multiple LLMs and VLMs across model families and scales, using both expert demonstrations from a planning-based teacher, as well as DAgger for iterative improvement. In a 2D manipulation benchmark, in both simulation and on a physical manipulator, the resulting controllers can identify task-relevant entities and plan feasible action sequences. Our results suggest that ASCII rendering can serve as a lightweight, interpretable modality bridge from images to text, complementing conventional VLA pipelines, and opening directions for VLA research with text-only backbones.
Abstract（参考訳）: Vision-Language-Action(VLA)コントローラは、大きなデータと計算要求を持つマルチモーダルバックボーンに依存して、アクション監視を備えた視覚言語モデル(VLM)を拡張することで構築されることが多い。テキストのみの大規模言語モデル(LLM)は,視覚的観察をASCII表現を用いてテキスト入力に変換すると,VLAスタイルのコントローラに適応できることを実証する。このASCII-as-visionインタフェースにより、LLMの既存のトレーニングおよびデプロイメントスタックが視覚状態を効率的に条件付けし、自然言語命令に従い、制約のある実行可能なアクションを生成することができる。我々は、計画ベースの教師による専門家によるデモンストレーションと、反復的な改善のためのDAggerを用いて、モデルファミリとスケールの複数のLLMとVLMを微調整し比較する。 2D操作ベンチマークでは、シミュレーションと物理マニピュレータの両方で、結果のコントローラはタスク関連エンティティを特定し、実行可能なアクションシーケンスを計画することができる。 ASCIIレンダリングは、画像からテキストへの軽量で解釈可能なモダリティブリッジとして機能し、従来のVLAパイプラインを補完し、テキストのみのバックボーンを用いたVLA研究の方向性を開くことができる。

論文の概要: ASCII Art Turns LLMs into VLA Controllers

関連論文リスト