Fugu-MT 論文翻訳(概要): ToolMem: Enhancing Multimodal Agents with Learnable Tool Capability Memory

論文の概要: ToolMem: Enhancing Multimodal Agents with Learnable Tool Capability Memory

arxiv url: http://arxiv.org/abs/2510.06664v1
Date: Wed, 08 Oct 2025 05:32:31 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-09 16:41:20.318026
Title: ToolMem: Enhancing Multimodal Agents with Learnable Tool Capability Memory
Title（参考訳）: ToolMem: 学習可能なツール能力メモリによるマルチモーダルエージェントの強化
Authors: Yunzhong Xiao, Yangmin Li, Hewei Wang, Yunlong Tang, Zora Zhiruo Wang,
Abstract要約: エージェントが以前のインタラクションからツール機能の記憶を開発できるようにするツールMemを提案する。各種テキスト生成および画像生成ニューラルツールの学習におけるToolMemの評価を行った。 ToolMemで強化されたエージェントは、テキストおよびマルチモーダル生成シナリオにおいて、ツールのパフォーマンスを14.8%、28.7%正確に予測する。
参考スコア（独自算出の注目度）: 9.63559753690456
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Agents utilizing tools powered by large language models (LLMs) or vision-language models (VLMs) have demonstrated remarkable progress in diverse tasks across text and visual modalities. Unlike traditional tools such as calculators, which give deterministic outputs, neural tools perform uncertainly across task scenarios. While different tools for a task may excel in varied scenarios, existing agents typically rely on fixed tools, thus limiting the flexibility in selecting the most suitable tool for specific tasks. In contrast, humans snowball their understanding of the capabilities of different tools by interacting with them, and apply this knowledge to select the optimal tool when solving a future task. To build agents that similarly benefit from this process, we propose ToolMem that enables agents to develop memories of tool capabilities from previous interactions, by summarizing their strengths and weaknesses and storing them in memory; at inference, the agent can retrieve relevant entries from ToolMem, and select the best tool to solve individual tasks more accurately. We evaluate ToolMem on learning varied text generation and text-to-image generation neural tools. Compared to no-memory, generic agents, we find ToolMem-augmented agents predict tool performance 14.8% and 28.7% more accurately across text and multimodal generation scenarios. Moreover, ToolMem facilitates optimal tool selection among multiple choices by 21% and 24% absolute increases in respective scenarios.
Abstract（参考訳）: 大規模言語モデル (LLM) や視覚言語モデル (VLM) を利用したエージェントは、テキストや視覚的モダリティの多種多様なタスクにおいて顕著な進歩を見せている。決定論的出力を与える電卓のような従来のツールとは異なり、ニューラルツールはタスクシナリオ間で不確実なパフォーマンスを行う。タスクのためのさまざまなツールが様々なシナリオで優れているが、既存のエージェントは固定されたツールに依存しているため、特定のタスクに最適なツールを選択する際の柔軟性が制限される。対照的に、人間は異なるツールの能力に対する理解を雪だるまにし、この知識を将来のタスクを解く際に最適なツールの選択に適用する。このプロセスから同様に恩恵を受けるエージェントを構築するために,エージェントは,その強みと弱さを要約し,メモリに格納することで,以前のインタラクションからツール能力の記憶を発達させることができるToolMemを提案する。各種テキスト生成および画像生成ニューラルツールの学習におけるツールMemの評価を行った。非メモリ、ジェネリックエージェントと比較して、ツールMem拡張エージェントは、テキストおよびマルチモーダル生成シナリオにおいて、ツールのパフォーマンスを14.8%、28.7%正確に予測している。さらに、ToolMemは複数の選択の最適なツール選択を容易にする。

論文の概要: ToolMem: Enhancing Multimodal Agents with Learnable Tool Capability Memory

関連論文リスト