Fugu-MT 論文翻訳(概要): RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models

論文の概要: RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models

arxiv url: http://arxiv.org/abs/2604.14951v1
Date: Thu, 16 Apr 2026 12:47:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-17 21:29:31.896608
Title: RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models
Title（参考訳）: RaTAツール:マルチモーダル大言語モデルを用いた検索ツールの選択
Authors: Gabriele Mattioli, Evelyn Turri, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara,
Abstract要約: オープンワールドマルチモーダルツール選択のための新しいフレームワークであるRaTA-Toolを紹介する。提案手法により,MLLMはマルチモーダルクエリを構造化されたタスク記述に変換し,次に最も適切なツールを検索することができる。タスク記述とツール選択の整合性をさらに向上するため、好みに基づく最適化段階を取り入れた。
参考スコア（独自算出の注目度）: 57.15854852525046
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources -- such as APIs, computational utilities, and specialized models -- to solve complex tasks beyond the reach of standalone language generation. While recent advances in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have expanded their reasoning and perception capabilities, existing tool-use methods are predominantly limited to text-only inputs and closed-world settings. Consequently, they struggle to interpret multimodal user instructions and cannot generalize to tools unseen during training. In this work, we introduce RaTA-Tool, a novel framework for open-world multimodal tool selection. Rather than learning direct mappings from user queries to fixed tool identifiers, our approach enables an MLLM to convert a multimodal query into a structured task description and subsequently retrieve the most appropriate tool by matching this representation against semantically rich, machine-readable tool descriptions. This retrieval-based formulation naturally supports extensibility to new tools without retraining. To further improve alignment between task descriptions and tool selection, we incorporate a preference-based optimization stage using Direct Preference Optimization (DPO). To support research in this setting, we also introduce the first dataset for open-world multimodal tool use, featuring standardized tool descriptions derived from Hugging Face model cards. Extensive experiments demonstrate that our approach significantly improves tool-selection performance, particularly in open-world, multimodal scenarios.
Abstract（参考訳）: 基礎モデルによるツール学習は、AIシステムに、APIや計算ユーティリティ、特殊なモデルといった外部リソースを呼び出し、スタンドアロンの言語生成の範囲を超えた複雑なタスクを解決する能力を提供することを目的としている。近年のLarge Language Models (LLMs) とMultimodal Large Language Models (MLLMs) の進歩により、推論と知覚能力が向上しているが、既存のツールの使用法は主にテキストのみの入力とクローズドワールド設定に限られている。その結果、マルチモーダルなユーザ命令を解釈するのに苦労し、トレーニング中に見えないツールに一般化することができない。本稿では,オープンソースのマルチモーダルツール選択のための新しいフレームワークであるRaTA-Toolを紹介する。ユーザクエリから固定ツール識別子への直接マッピングを学習する代わりに、MLLMはマルチモーダルクエリを構造化されたタスク記述に変換し、この表現を意味的にリッチでマシン可読なツール記述とマッチングすることで、最も適切なツールを検索することができる。この検索に基づく定式化は、再訓練することなく、自然に新しいツールの拡張性をサポートする。タスク記述とツール選択の整合性をさらに向上するため,DPO(Direct Preference Optimization)を用いた嗜好ベースの最適化ステージを組み込んだ。この環境での研究を支援するために、Hugging Faceモデルカードから派生した標準化されたツール記述を特徴とする、オープンワールドのマルチモーダルツール使用のための最初のデータセットも紹介する。大規模な実験により,オープンワールド,マルチモーダルシナリオにおいて,ツール選択性能が著しく向上することが示された。

論文の概要: RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models

関連論文リスト