Fugu-MT 論文翻訳(概要): COLT: Enhancing Video Large Language Models with Continual Tool Usage

論文の概要: COLT: Enhancing Video Large Language Models with Continual Tool Usage

arxiv url: http://arxiv.org/abs/2509.18754v2
Date: Wed, 24 Sep 2025 07:53:56 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-25 14:09:11.252838
Title: COLT: Enhancing Video Large Language Models with Continual Tool Usage
Title（参考訳）: COLT: 継続的ツール使用によるビデオ大言語モデルの強化
Authors: Yuyang Liu, Xinyuan Shi, Xiaondan Liang,
Abstract要約: 連続するツールストリームにおけるツール使用能力を自動取得するContinuaL Tool usage(COLT)を提案する。我々のCOLTは学習可能なツールコードブックをツール固有のメモリシステムとして組み込んでいる。ビデオLLMのツール使用可能性を解き放つために,ビデオ中心のツール利用指導データセットであるVideoToolBenchを収集する。
参考スコア（独自算出の注目度）: 9.709506072510512
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The success of Large Language Models (LLMs) has significantly propelled the research of video understanding. To harvest the benefits of well-trained expert models (i.e., tools), video LLMs prioritize the exploration of tool usage capabilities. Existing methods either prompt closed-source LLMs or employ the instruction tuning paradigm for tool-use fine-tuning. These methods, however, assume an established repository of fixed tools and struggle to generalize to real-world environments where tool data is perpetually evolving and streaming in. To this end, we propose to enhance open-source video LLMs with COntinuaL Tool usage (termed COLT), which automatically acquires tool-use ability in a successive tool stream without suffering 'catastrophic forgetting' of the past learned tools. Specifically, our COLT incorporates a learnable tool codebook as a tool-specific memory system. Then relevant tools are dynamically selected based on the similarity between user instruction and tool features within the codebook. To unleash the tool usage potential of video LLMs, we collect a video-centric tool-use instruction tuning dataset VideoToolBench. Extensive experiments on both previous video LLM benchmarks and the tool-use-specific VideoToolBench dataset demonstrate the state-of-the-art performance of our proposed COLT.
Abstract（参考訳）: LLM(Large Language Models)の成功は、ビデオ理解の研究を著しく推進している。十分に訓練されたエキスパートモデル(例えばツール)の利点を回収するために、ビデオLLMはツール使用能力の探索を優先する。既存の手法は、クローズドソース LLM をプロンプトするか、あるいはツール用ファインチューニングにインストラクションチューニングパラダイムを使用する。しかし、これらの手法は固定ツールの確立されたリポジトリを前提としており、ツールデータが永久に進化し、ストリーミングされる現実世界環境への一般化に苦慮している。そこで本稿では,ContinuaL Tool を用いたオープンソースビデオ LLM の改良 (COLT) を提案する。具体的には、COLTは学習可能なツールコードブックをツール固有のメモリシステムとして組み込んでいる。次に、ユーザインストラクションとコードブック内のツール機能との類似性に基づいて、関連ツールを動的に選択する。ビデオLLMのツール使用可能性を解き放つために,ビデオ中心のツール利用指導データセットであるVideoToolBenchを収集する。従来のビデオLLMベンチマークとツール使用専用のVideoToolBenchデータセットの両方に対する大規模な実験により、提案したCOLTの最先端性能が実証された。

論文の概要: COLT: Enhancing Video Large Language Models with Continual Tool Usage

関連論文リスト