Fugu-MT 論文翻訳(概要): Instruction-as-State: Environment-Guided and State-Conditioned Semantic Understanding for Embodied Navigation

論文の概要: Instruction-as-State: Environment-Guided and State-Conditioned Semantic Understanding for Embodied Navigation

arxiv url: http://arxiv.org/abs/2604.18223v1
Date: Mon, 20 Apr 2026 13:09:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.884587
Title: Instruction-as-State: Environment-Guided and State-Conditioned Semantic Understanding for Embodied Navigation
Title（参考訳）: インストラクション・アズ・ステート: 身体ナビゲーションのための環境ガイドと状態規定セマンティック理解
Authors: Zhen Liu, Yuhan Liu, Jinjun Wang, Jianyi Liu, Wei Song, Jingwen Fu,
Abstract要約: S-EGIU(State-Entangled Environment-Guided Instruction Understanding)を紹介する。 S-EGIUは、エージェントの知覚状態に基づいて段階的に進化する決定関連トークンレベルの命令状態である。これは、REVERIE Test Unseenで+2.68%のSPLゲインを含む、いくつかの主要なメトリクスで強力なパフォーマンスを提供する。
参考スコア（独自算出の注目度）: 38.77209165510599
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-and-Language Navigation requires agents to follow natural-language instructions in visually changing environments. A central challenge is the dynamic entanglement between language and observations: the meaning of instruction shifts as the agent's field of view and spatial context evolve. However, many existing models encode the instruction as a static global representation, limiting their ability to adapt instruction meaning to the current visual context. We therefore model instruction understanding as an Instruction-as-State variable: a decision-relevant, token-level instruction state that evolves step by step conditioned on the agent's perceptual state, where the perceptual state denotes the observation-grounded navigation context at each step. To realize this principle, we introduce State-Entangled Environment-Guided Instruction Understanding (S-EGIU), a coarse-to-fine framework for state-conditioned segment activation and token-level semantic refinement. At the coarse level, S-EGIU activates the instruction segment whose semantics align with the current observation. At the fine level, it refines the activated segment through observation-guided token grounding and contextual modeling, sharpening its internal semantics under the current observation. Together, these stages maintain an instruction state that is continuously updated according to the agent's perceptual state during navigation. S-EGIU delivers strong performance on several key metrics, including a +2.68% SPL gain on REVERIE Test Unseen, and demonstrates consistent efficiency gains across multiple VLN benchmarks, underscoring the value of dynamic instruction--perception entanglement.
Abstract（参考訳）: 視覚と言語ナビゲーションでは、エージェントは視覚的に変化する環境において自然言語の指示に従う必要がある。中心的な課題は言語と観察のダイナミックな絡み合いであり、エージェントの視野と空間的文脈が進化するにつれて、指示の意味が変化していく。しかし、多くの既存モデルは命令を静的なグローバル表現としてエンコードし、命令の意味を現在の視覚的コンテキストに適応させる能力を制限する。そこで我々は,命令理解をインストラクション・アズ・ステート(Instruction-as-State)変数としてモデル化する: エージェントの知覚状態に基づいて段階的に進化する決定関連トークンレベルの命令状態。この原則を実現するために、ステートコンディショニングされたセグメントアクティベーションとトークンレベルのセマンティックリファインメントのための粗大なフレームワークであるState-Entangled Environment-Guided Instruction Understanding (S-EGIU)を導入する。粗いレベルでは、S-EGIUは、現在の観察とセマンティクスが一致した命令セグメントを起動する。微細なレベルでは、観測誘導トークンの接地と文脈モデリングによって活性化セグメントを洗練し、現在の観測下での内部意味を鋭くする。これらの段階は、ナビゲーション中にエージェントの知覚状態に応じて継続的に更新される命令状態を保持する。 S-EGIUはREVERIE Test Unseenで+2.68%のSPLゲインを達成し、複数のVLNベンチマークで一貫した効率向上を示し、動的命令-パーセプションの絡み合いの価値を強調している。

論文の概要: Instruction-as-State: Environment-Guided and State-Conditioned Semantic Understanding for Embodied Navigation

関連論文リスト