Fugu-MT 論文翻訳(概要): Enhancing Lightweight Vision Language Models through Group Competitive Learning for Socially Compliant Navigation

論文の概要: Enhancing Lightweight Vision Language Models through Group Competitive Learning for Socially Compliant Navigation

arxiv url: http://arxiv.org/abs/2603.11447v1
Date: Thu, 12 Mar 2026 02:16:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-21 18:33:56.706797
Title: Enhancing Lightweight Vision Language Models through Group Competitive Learning for Socially Compliant Navigation
Title（参考訳）: ソーシャル・コンピテント・ナビゲーションのためのグループ競争学習による軽量視覚言語モデルの強化
Authors: Xinyu Zhang, Atsushi Konno, Toshihiko Yamasaki, Ling Xiao,
Abstract要約: 社会ロボットナビゲーションには、シーンセマンティクスと人間の社会規範の洗練された統合が必要である。軽量ビジョン言語モデル(VLM)は効率的な推論を可能にするが、しばしばより弱い推論と意思決定性能を示す。本稿では,軽量VLMの能力向上を目的としたGCL(Group Competitive Learning)を提案する。
参考スコア（独自算出の注目度）: 29.741263131312547
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Social robot navigation requires a sophisticated integration of scene semantics and human social norms. Scaling up Vision Language Models (VLMs) generally improves reasoning and decision-making capabilities for socially compliant navigation. However, increased model size incurs substantial computational overhead, limiting suitability for real-time robotic deployment. Conversely, lightweight VLMs enable efficient inference but often exhibit weaker reasoning and decision-making performance in socially complex environments. Achieving both strong reasoning ability and efficiency remains an open challenge. To bridge this gap, we propose Group Competitive Learning (GCL), a strategy designed to amplify the capabilities of lightweight VLMs. Our strategy introduces the Group Competitive Objective (GCO) to harmonize global semantics with distributional regularization, alongside Asymmetric Group Optimization (AGO) to explore the upper limits of model performance. Empirical evaluations on social navigation benchmarks demonstrate that GCL significantly elevates VLM performance. Specifically, GCL enables the Qwen2.5-VL-3B learner model and guide Qwen3-VL-4B to achieve an F1 score of 0.968 and 0.914, representing 40\% and 12\% improvement over vanilla supervised fine-tuning (SFT). Notably, under vanilla SFT, the 3B model initially trails the 8B model (F1: 0.692 vs. 0.755). However, through the GCL, the 3B model outperforms (28\%) the 8B baseline model. These results suggest that GCL provides an effective solution for achieving both high accuracy and computational efficiency in real-world deployment.
Abstract（参考訳）: 社会ロボットナビゲーションには、シーンセマンティクスと人間の社会規範の洗練された統合が必要である。視覚言語モデル(VLM)のスケールアップは一般的に、社会的に準拠するナビゲーションの推論と意思決定能力を改善する。しかし、モデルのサイズが大きくなると計算オーバーヘッドが大きくなり、リアルタイムのロボット展開に適する可能性が制限される。逆に、軽量なVLMは効率的な推論を可能にするが、社会的に複雑な環境では、より弱い推論と意思決定性能を示すことが多い。強い推論能力と効率性の両方を達成することは、依然としてオープンな課題です。このギャップを埋めるために、軽量VLMの能力を増幅する戦略であるGCL(Group Competitive Learning)を提案する。我々の戦略では,グローバルセマンティクスと分布正規化を調和させるGCO(Group Competitive Objective)を導入し,非対称グループ最適化(AGO)とともにモデル性能の上限を探索する。ソーシャルナビゲーションベンチマークの実証評価により、GCLはVLMの性能を大幅に向上することが示された。具体的には、Qwen2.5-VL-3B学習者モデルとガイドQwen3-VL-4BがF1スコア0.968と0.914を達成でき、バニラ教師付き微調整(SFT)よりも40倍、12倍改善されている。特に、バニラSFTの下では、当初3Bモデルは8Bモデル(F1: 0.692 vs. 0.755)を踏襲していた。しかし、GCLを通して、3Bモデルは8Bベースラインモデルよりも(28\%)優れている。これらの結果から,GCLは実世界の展開において,高精度かつ計算効率の両立に有効なソリューションを提供すると考えられる。

論文の概要: Enhancing Lightweight Vision Language Models through Group Competitive Learning for Socially Compliant Navigation

関連論文リスト