Fugu-MT 論文翻訳(概要): AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents

論文の概要: AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents

arxiv url: http://arxiv.org/abs/2605.17933v1
Date: Mon, 18 May 2026 06:41:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:48.928355
Title: AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents
Title（参考訳）: AtlasVA: 教師なしVLMエージェントのための自己進化型ビジュアルスキルメモリ
Authors: Pan Wang, Yihao Hu, Xiujin Liu, Jingchu Yang, Hang Wang, Zhihao Wen,
Abstract要約: 視覚言語モデル(VLM)エージェントは、長期のタスクでの経験を再利用するために、メモリ強化された強化学習に依存している。既存のフレームワークの多くは、メモリをテキストとして保存し、それを要約または洗練するために独自の教師モデルに依存している。教師なしの視覚スキル記憶フレームワークである textbfAtlasVA を提案する。
参考スコア（独自算出の注目度）: 22.846371945424988
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language model (VLM) agents increasingly rely on memory-augmented reinforcement learning to reuse experience across long-horizon tasks, yet most existing frameworks store memory as text and depend on proprietary teacher models to summarize or refine it. This design is poorly matched to spatial decision making: geometric priors are compressed into lossy language, and sparse interaction is often supervised through delayed textual feedback rather than dense visually grounded signals. We argue that reusable experience for VLM agents should remain visually grounded. Based on this insight, we propose \textbf{AtlasVA}, a teacher-free visual skill memory framework that organizes memory into three complementary layers: spatial heatmaps, visual exemplars, and symbolic text skills. AtlasVA further evolves danger and affinity atlases directly from trajectory statistics and lightweight grid heuristics, and reuses these self-evolving atlases as potential-based shaping rewards for reinforcement learning. This unifies perception, memory, and optimization without external LLM supervision. Experiments on \textsc{Sokoban}, \textsc{FrozenLake}, 3D embodied navigation, and 3D robotic manipulation benchmarks show that AtlasVA consistently outperforms text-centric memory baselines and competitive VLM agents, with especially strong gains on spatially intensive tasks. Homepage: https://wangpan-ustc.github.io/AtlasvaWeb
Abstract（参考訳）: 視覚言語モデル(VLM)エージェントは、長期のタスクにまたがる経験を再利用するために、メモリ拡張強化学習にますます依存しているが、既存のほとんどのフレームワークは、メモリをテキストとして保存し、それを要約または洗練するためにプロプライエタリな教師モデルに依存する。この設計は空間的意思決定とあまり一致しない: 幾何学的事前は損失のある言語に圧縮され、疎結合はしばしば、密集した視覚的接地信号ではなく、遅延したテキストフィードバックによって監督される。我々は、VLMエージェントの再利用体験は、視覚的基盤を保たなければならないと論じている。この知見に基づいて,空間的ヒートマップ,視覚経験者,記号的テキストスキルの3つの相補的なレイヤにメモリを整理する教師なし視覚スキル記憶フレームワークである「textbf{AtlasVA}」を提案する。 AtlasVAはさらに、軌道統計と軽量グリッドヒューリスティックから直接危険と親和性アラスを進化させ、これらの自己進化アラスを、強化学習のためのポテンシャルベースの形状報酬として再利用する。これは、外部のLLM監督なしで知覚、記憶、最適化を統一する。 textsc{Sokoban}, \textsc{FrozenLake}, 3D Embodied Navigation, そして3Dロボット操作ベンチマークの実験によると、AtlasVAはテキスト中心のメモリベースラインと競合するVLMエージェントを一貫して上回り、特に空間的に集中的なタスクに強い利益をもたらす。ホームページ:https://wangpan-ustc.github.io/AtlasvaWeb

論文の概要: AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents

関連論文リスト