Fugu-MT 論文翻訳(概要): GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning

論文の概要: GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning

arxiv url: http://arxiv.org/abs/2604.17241v1
Date: Sun, 19 Apr 2026 04:04:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.411561
Title: GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning
Title（参考訳）: GaLa: 手続き計画のためのハイパーグラフ型ビジュアル言語モデル
Authors: Kun Wang, Yiming Li, Mingcheng Qu, Aqiang Zhang, Guang Yang, Tonghua Su,
Abstract要約: オブジェクト属性に符号化された暗黙的な空間関係と深い意味構造は、具体化されたAIシステムにおける手続き的計画に不可欠である。マルチモーダルな手続き計画のための視覚言語フレームワークであるGaLaを提案する。 GaLaは,実行成功率,LCS,計画正当性において,既存手法よりも大幅に優れていた。
参考スコア（独自算出の注目度）: 14.265218749993956
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Implicit spatial relations and deep semantic structures encoded in object attributes are crucial for procedural planning in embodied AI systems. However, existing approaches often over rely on the reasoning capabilities of vision language models (VLMs) themselves, while overlooking the rich structured semantic information that can be mined from multimodal inputs. As a result, models struggle to effectively understand functional spatial relationships in complex scenes. To fully exploit implicit spatial relations and deep semantic structures in multimodal data, we propose GaLa, a vision language framework for multimodal procedural planning. GaLa introduces a hypergraph-based representation, where object instances in the image are modeled as nodes, and region-level hyperedges are constructed by aggregating objects according to their attributes and functional semantics. This design explicitly captures implicit semantic relations among objects as well as the hierarchical organization of functional regions. Furthermore, we design a TriView HyperGraph Encoder that enforces semantic consistency across the node view, area view, and node area association view via contrastive learning, enabling hypergraph semantics to be more effectively injected into downstream VLM reasoning. Extensive experiments on the ActPlan1K and ALFRED benchmarks demonstrate that GaLa significantly outperforms existing methods in terms of execution success rate, LCS, and planning correctness.
Abstract（参考訳）: オブジェクト属性に符号化された暗黙的な空間関係と深い意味構造は、具体化されたAIシステムにおける手続き的計画に不可欠である。しかしながら、既存のアプローチは視覚言語モデル(VLM)自体の推論能力に頼らず、マルチモーダル入力からマイニングできるリッチな構造化されたセマンティック情報を見越すことが多い。その結果、複雑なシーンにおける機能的空間的関係を効果的に理解するのに、モデルは苦労する。マルチモーダルデータにおける暗黙的空間関係と深い意味構造を完全に活用するために,多モーダルな手続き計画のための視覚言語フレームワークであるGaLaを提案する。 GaLaはハイパーグラフベースの表現を導入し、画像内のオブジェクトインスタンスをノードとしてモデル化し、領域レベルのハイパーエッジは、属性と機能的セマンティクスに基づいてオブジェクトを集約することによって構築する。この設計は、対象間の暗黙的な意味関係と機能領域の階層的構造を明示的に捉えている。さらに,TriView HyperGraph Encoderを設計し,ノードビュー,エリアビュー,ノードエリアアソシエーションビュー間のセマンティック一貫性をコントラスト学習により実現し,ハイパーグラフセマンティクスをより効果的に下流VLM推論に注入できるようにする。 ActPlan1K と ALFRED ベンチマークの大規模な実験により、GaLa は実行成功率、LCS、計画正当性において既存の手法を大幅に上回っていることが示された。

論文の概要: GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning

関連論文リスト