Fugu-MT 論文翻訳(概要): RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs

論文の概要: RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs

arxiv url: http://arxiv.org/abs/2604.07765v1
Date: Thu, 09 Apr 2026 03:40:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-10 18:34:05.677674
Title: RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs
Title（参考訳）: RemoteAgent: 人工血管とRLをベースとしたエージェントMLLMによる地球観測
Authors: Liang Yao, Shengxiang Xu, Fan Liu, Chuanyi Zhang, Bishun Yao, Rui Min, Yongjun Li, Chaoqian Ouyang, Shimin Di, Min-Ling Zhang,
Abstract要約: 地球観測システムは、曖昧な自然言語を通じて要求を表現するドメインの専門家を支援するように設計されている。実用的なEOAIシステムは、あいまいな人間のクエリと適切な多粒度視覚分析タスクのギャップを埋めなければならない。 MLLMの能力固有の境界を戦略的に尊重するエージェントフレームワークであるRemoteAgentを提案する。
参考スコア（独自算出の注目度）: 55.44696796126619
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Earth Observation (EO) systems are essentially designed to support domain experts who often express their requirements through vague natural language rather than precise, machine-friendly instructions. Depending on the specific application scenario, these vague queries can demand vastly different levels of visual precision. Consequently, a practical EO AI system must bridge the gap between ambiguous human queries and the appropriate multi-granularity visual analysis tasks, ranging from holistic image interpretation to fine-grained pixel-wise predictions. While Multi-modal Large Language Models (MLLMs) demonstrate strong semantic understanding, their text-based output format is inherently ill-suited for dense, precision-critical spatial predictions. Existing agentic frameworks address this limitation by delegating tasks to external tools, but indiscriminate tool invocation is computationally inefficient and underutilizes the MLLM's native capabilities. To this end, we propose RemoteAgent, an agentic framework that strategically respects the intrinsic capability boundaries of MLLMs. To empower this framework to understand real user intents, we construct VagueEO, a human-centric instruction dataset pairing EO tasks with simulated vague natural-language queries. By leveraging VagueEO for reinforcement fine-tuning, we align an MLLM into a robust cognitive core that directly resolves image- and sparse region-level tasks. Consequently, RemoteAgent processes suitable tasks internally while intelligently orchestrating specialized tools via the Model Context Protocol exclusively for dense predictions. Extensive experiments demonstrate that RemoteAgent achieves robust intent recognition capabilities while delivering highly competitive performance across diverse EO tasks.
Abstract（参考訳）: 地球観測システム(EO)は、正確に機械に優しい指示ではなく、あいまいな自然言語で要求を表現しているドメインの専門家を支援するように設計されている。特定のアプリケーションシナリオによって、これらのあいまいなクエリは、非常に異なるレベルの視覚的精度を必要とする可能性がある。その結果、実用的なEOAIシステムは、全体像解釈から細かなピクセルワイズ予測まで、あいまいな人間のクエリと適切な多粒度視覚分析タスクのギャップを埋める必要がある。 MLLM(Multi-modal Large Language Models)は、強い意味理解を示すが、テキストベースの出力形式は本質的に、密で精度の高い空間予測には不適である。既存のエージェントフレームワークは、タスクを外部ツールに委譲することで、この制限に対処するが、非差別的なツール呼び出しは計算的に非効率であり、MLLMのネイティブ機能を不活用する。そこで本稿では,MLLMの固有の機能境界を戦略的に尊重するエージェントフレームワークであるRemoteAgentを提案する。実際のユーザ意図を理解するために,EOタスクを擬似自然言語クエリでペアリングする人間中心の命令データセットであるVagueEOを構築した。強化微調整にVagueEOを活用することで、MLLMを堅牢な認知コアに整列させ、画像と領域レベルのタスクを直接解決する。その結果、RemoteAgentは内部で適切なタスクを処理し、密集した予測専用のModel Context Protocolを通じて特別なツールをインテリジェントにオーケストレーションする。大規模な実験により、RemoteAgentは多様なEOタスクにまたがる高い競争性能を提供しながら、堅牢な意図認識能力を達成することが実証された。

論文の概要: RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs

関連論文リスト