Fugu-MT 論文翻訳(概要): X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction

論文の概要: X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction

arxiv url: http://arxiv.org/abs/2605.05765v1
Date: Thu, 07 May 2026 06:58:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.580478
Title: X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction
Title（参考訳）: X-OmniClawテクニカルレポート:マルチモーダル理解とインタラクションのための統一モバイルエージェント
Authors: Xiaoming Ren, Ru Zhen, Chao Li, Yang Song, Qiuxia Hou, Yanhao Zhang, Peng Liu, Qi Qi, Quanlong Zheng, Qi Wu, Zhenyi Liao, Binqiang Pan, Haobo Ji, Haonan Lu,
Abstract要約: X-OmniClawは,Androidエコシステムにおけるマルチモーダル理解とインタラクションのために設計された,統一されたモバイルエージェントである。この知覚、記憶、行動のアーキテクチャにより、エージェントはコンテキスト認識の高い複雑なモバイルタスクを処理できる。
参考スコア（独自算出の注目度）: 23.113056221576667
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Inspired by the development of OpenClaw, there is a growing demand for mobile-based personal agents capable of handling complex and intuitive interactions. In this technical report, we introduce X-OmniClaw, a unified mobile agent designed for multimodal understanding and interaction in the Android ecosystem. This unified architecture of perception, memory, and action enables the agent to handle complex mobile tasks with high contextual awareness. Specifically, Omni Perception provides a unified multimodal ingress pipeline that integrates UI states, real-world visual contexts, and speech inputs, leveraging a temporal alignment module to decompose raw data into structured multimodal intent representations. Omni Memory leverages multimodal memory optimization to enhance personalized intelligence by integrating runtime working memory for task continuity with long-term personal memory distilled from local data, enabling highly context-aware and personalized interactions. Finally, Omni Action employs a hybrid grounding strategy that combines structural XML metadata with visual perception for robust interaction. Through Behavior Cloning and Trajectory Replay, the system captures user navigation as reusable skills, enabling precise direct-access execution. Demonstrations across diverse scenarios show that X-OmniClaw effectively enhances interaction efficiency and task reliability, providing a practical architectural blueprint for the next generation of mobile-native personal assistants.
Abstract（参考訳）: OpenClawの開発にインスパイアされたモバイルベースのパーソナルエージェントは、複雑で直感的なインタラクションを扱うことができる。本技術報告では,Androidエコシステムにおけるマルチモーダル理解とインタラクションを目的とした統合モバイルエージェントであるX-OmniClawを紹介する。この統合された知覚、記憶、行動のアーキテクチャにより、エージェントはコンテキスト認識の高い複雑なモバイルタスクを処理できる。具体的には、Omni PerceptionはUI状態、現実世界の視覚的コンテキスト、音声入力を統合した統合マルチモーダル入力パイプラインを提供し、時間的アライメントモジュールを活用して、生データを構造化マルチモーダル意図表現に分解する。 Omni Memoryはマルチモーダルメモリの最適化を活用し、タスク継続性のための実行時ワーキングメモリとローカルデータから抽出した長期パーソナルメモリを統合することで、パーソナライズされたインテリジェンスを向上させる。最後に、Omni Actionは、構造的XMLメタデータと視覚的知覚を組み合わせた、堅牢な相互作用のためのハイブリッド基盤戦略を採用している。 Behavior CloningとTrjectory Replayを通じて、システムはユーザナビゲーションを再利用可能なスキルとして捉え、正確なダイレクトアクセス実行を可能にする。多様なシナリオにわたるデモは、X-OmniClawがインタラクション効率とタスク信頼性を効果的に向上し、次世代のモバイルネイティブパーソナルアシスタントのための実用的なアーキテクチャの青写真を提供することを示している。

論文の概要: X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction

関連論文リスト