Fugu-MT 論文翻訳(概要): From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics

論文の概要: From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics

arxiv url: http://arxiv.org/abs/2511.02427v1
Date: Tue, 04 Nov 2025 09:58:29 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 18:47:05.886517
Title: From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics
Title（参考訳）: 実験室から実世界の応用へ:モバイルロボティクス用エッジデバイスにおけるゼロショットシーン解釈の評価
Authors: Nicolas Schuler, Lea Dewald, Nick Baldig, Jürgen Graf,
Abstract要約: 本稿では、シーン認識とアクション認識のタスクにおける最先端のビジュアル言語モデル(VLM)の機能について検討する。提案したパイプラインは、様々な現実世界の街並み、キャンパス内、屋内シナリオからなる多様なデータセットに基づいて評価される。実験的な評価では、エッジデバイス上でのこれらの小さなモデルの可能性、特に課題、弱点、固有のモデルバイアス、得られた情報の適用について論じている。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video Understanding, Scene Interpretation and Commonsense Reasoning are highly challenging tasks enabling the interpretation of visual information, allowing agents to perceive, interact with and make rational decisions in its environment. Large Language Models (LLMs) and Visual Language Models (VLMs) have shown remarkable advancements in these areas in recent years, enabling domain-specific applications as well as zero-shot open vocabulary tasks, combining multiple domains. However, the required computational complexity poses challenges for their application on edge devices and in the context of Mobile Robotics, especially considering the trade-off between accuracy and inference time. In this paper, we investigate the capabilities of state-of-the-art VLMs for the task of Scene Interpretation and Action Recognition, with special regard to small VLMs capable of being deployed to edge devices in the context of Mobile Robotics. The proposed pipeline is evaluated on a diverse dataset consisting of various real-world cityscape, on-campus and indoor scenarios. The experimental evaluation discusses the potential of these small models on edge devices, with particular emphasis on challenges, weaknesses, inherent model biases and the application of the gained information. Supplementary material is provided via the following repository: https://datahub.rz.rptu.de/hstr-csrl-public/publications/scene-interpretation-on-edge-devices/
Abstract（参考訳）: 映像理解、シーン解釈、コモンセンス推論は、視覚情報の解釈を可能にし、エージェントがその環境において合理的な決定を知覚し、対話し、行うことができる、非常に困難なタスクである。大規模言語モデル (LLMs) とビジュアル言語モデル (VLMs) は近年,これらの領域で顕著な進歩を見せている。しかし、必要となる計算複雑性は、エッジデバイスやモバイルロボティクスの文脈において、特に正確性と推論時間のトレードオフを考慮して、その応用に課題をもたらす。本稿では,モバイルロボティクスの文脈でエッジデバイスに展開可能な小型のVLMについて,シーン解釈とアクション認識のタスクにおける最先端のVLMの機能について検討する。提案するパイプラインは,現実世界の都市景観,オンキャンプ,屋内シナリオからなる多様なデータセットに基づいて評価される。実験的な評価では、エッジデバイス上でのこれらの小さなモデルの可能性、特に課題、弱点、固有のモデルバイアス、得られた情報の適用について論じている。追加資料は以下のリポジトリを通じて提供される。 https://datahub.rz.rptu.de/hstr-csrl-publications/scene-prepretation-on-edge-devices/

論文の概要: From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics

関連論文リスト