Fugu-MT 論文翻訳(概要): A Framework for Low-Latency, LLM-driven Multimodal Interaction on the Pepper Robot

論文の概要: A Framework for Low-Latency, LLM-driven Multimodal Interaction on the Pepper Robot

arxiv url: http://arxiv.org/abs/2603.21013v1
Date: Fri, 09 Jan 2026 17:33:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 02:36:12.961611
Title: A Framework for Low-Latency, LLM-driven Multimodal Interaction on the Pepper Robot
Title（参考訳）: ペッパーロボットにおける低レイテンシLLM駆動型マルチモーダルインタラクションのためのフレームワーク
Authors: Erich Studerus, Vivienne Jia Zhong, Stephan Vonschallen,
Abstract要約: 我々はPepperロボットのためのオープンソースのAndroidフレームワークを提案する。エンドツーエンド音声合成(S2S)モデルを統合し,低レイテンシインタラクションを実現する。我々は,大規模言語モデルをエージェントプランナに高める機能拡張を実装した。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite recent advances in integrating Large Language Models (LLMs) into social robotics, two weaknesses persist. First, existing implementations on platforms like Pepper often rely on cascaded Speech-to-Text (STT)->LLM->Text-to-Speech (TTS) pipelines, resulting in high latency and the loss of paralinguistic information. Second, most implementations fail to fully leverage the LLM's capabilities for multimodal perception and agentic control. We present an open-source Android framework for the Pepper robot that addresses these limitations through two key innovations. First, we integrate end-to-end Speech-to-Speech (S2S) models to achieve low-latency interaction while preserving paralinguistic cues and enabling adaptive intonation. Second, we implement extensive Function Calling capabilities that elevate the LLM to an agentic planner, orchestrating robot actions (navigation, gaze control, tablet interaction) and integrating diverse multimodal feedback (vision, touch, system state). The framework runs on the robot's tablet but can also be built to run on regular Android smartphones or tablets, decoupling development from robot hardware. This work provides the HRI community with a practical, extensible platform for exploring advanced LLM-driven embodied interaction.
Abstract（参考訳）: 大規模言語モデル(LLM)を社会ロボティクスに統合する最近の進歩にもかかわらず、2つの弱点は残る。第一に、Pepperのようなプラットフォーム上の既存の実装は、しばしばカスケードされたSpeech-to-Text(STT)->LLM->Text-to-Speech(TTS)パイプラインに依存し、高いレイテンシとパラ言語情報を失う。第二に、ほとんどの実装はマルチモーダル認識とエージェント制御のためのLLMの機能を完全に活用することができない。我々はPepperロボットのためのオープンソースのAndroidフレームワークを提案し、2つの重要なイノベーションを通じてこれらの制限に対処する。まず、パラ言語的手がかりを保ち、適応的イントネーションを実現しつつ、低レイテンシ相互作用を実現するために、エンドツーエンド音声音声合成(S2S)モデルを統合する。第2に, LLMをエージェントプランナに昇格させ, ロボット動作(ナビゲーション, 視線制御, タブレットインタラクション)を編成し, 多様なマルチモーダルフィードバック(ビジョン, タッチ, システム状態)を統合する機能呼び出し機能を実装する。このフレームワークはロボットのタブレットで動くが、通常のAndroidスマートフォンやタブレットで動くように構築することもできる。この研究により、HRIコミュニティは、高度なLCM駆動型エンボディドインタラクションを探索するための実用的で拡張可能なプラットフォームを提供する。

論文の概要: A Framework for Low-Latency, LLM-driven Multimodal Interaction on the Pepper Robot

関連論文リスト