Fugu-MT 論文翻訳(概要): Lightweight Visual Reasoning for Socially-Aware Robots

論文の概要: Lightweight Visual Reasoning for Socially-Aware Robots

arxiv url: http://arxiv.org/abs/2603.03942v1
Date: Wed, 04 Mar 2026 11:08:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:41.886953
Title: Lightweight Visual Reasoning for Socially-Aware Robots
Title（参考訳）: 社会に配慮したロボットのための軽量ビジュアル推論
Authors: Alessio Galatolo, Ronald Cumbal, Alexandros Rouchitsas, Katie Winkle, Didem Gürdür Broo, Ginevra Castellano,
Abstract要約: 視覚言語モデル(VLM)におけるLLMとビジョンエンコーダのループを閉じる軽量な言語間フィードバックモジュールを提案する。本研究では,シミュレーション環境におけるナビゲーション,シーン記述の逐次的記述,人間意図認識という,ロボット中心の3つのタスクに対して,このアプローチを評価する。その結果,Qwen 2.5 (7B) を$3.3%(非距離),$+0.057$記述スコア,$+2.93%$精度で改善し,さらに$3%未満のパラメータが得られた。
参考スコア（独自算出の注目度）: 41.776442767736604
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Robots operating in shared human environments must not only navigate, interact, and detect their surroundings, they must also interpret and respond to dynamic, and often unpredictable, human behaviours. Although recent advances have shown promise in enhancing robotic perception and instruction-following using Vision-Language Models (VLMs), they remain limited in addressing the complexities of multimodal human-robot interactions (HRI). Motivated by this challenge, we introduce a lightweight language-to-vision feedback module that closes the loop between an LLM and the vision encoder in VLMs. The module projects image-token hidden states through a gated Multi-Layer Perceptron (MLP) back into the encoder input, prompting a second pass that reinterprets the scene under text context. We evaluate this approach on three robotics-centred tasks: navigation in a simulated environment (Habitat), sequential scene description (Mementos-Robotics), and human-intention recognition (our HRI dataset). Results show that our method improves Qwen 2.5 (7B) by $3.3\%$ (less distance), $+0.057$ description score, and $+2.93\%$ accuracy, with less than $3\%$ extra parameters; Gemma 3 (4B) and LLaVA OV 1.5 (4B) show mixed navigation results but gains $+0.111,+0.055$ and $+10.81\%,+4.79\%$ on the latter two tasks. Code is available at https://github.com/alessioGalatolo/VLM-Reasoning-for-Robotics
Abstract（参考訳）: 共有された人間の環境で動作しているロボットは、周囲をナビゲートし、対話し、検出するだけでなく、動的で予測不可能な人間の振る舞いを解釈し、応答する必要がある。近年の進歩は、視覚言語モデル(VLM)を用いたロボット知覚と指示追従の強化を約束しているが、マルチモーダルな人間-ロボット相互作用(HRI)の複雑さに対処することには限界がある。この課題に乗じて,LLMとVLMの視覚エンコーダのループを閉じる,軽量な言語間フィードバックモジュールを導入する。モジュールは、エンコーダ入力にゲートされたMulti-Layer Perceptron (MLP)を通して隠された状態を投影し、テキストコンテキスト下でシーンを再解釈する第2のパスを発行する。本研究では,シミュレーション環境におけるナビゲーション(Habitat),逐次シーン記述(Mementos-Robotics),人間意図認識(HRIデータセット)の3つのタスクに対して,このアプローチを評価する。その結果,Qwen 2.5 (7B) を (3.3 %$ (無距離),$+0.057$説明スコア,$+2.93\%$精度で 3 %$余剰パラメータ以下で改善し,Gemma 3 (4B) と LLaVA OV 1.5 (4B) は混合ナビゲーション結果を示すが,後者の2つのタスクでは$+0.111,+0.055$,$+10.81\%,$+4.79\%が得られた。コードはhttps://github.com/alessioGalatolo/VLM-Reasoning-for-Roboticsで公開されている。

論文の概要: Lightweight Visual Reasoning for Socially-Aware Robots

関連論文リスト