Fugu-MT 論文翻訳(概要): Rethinking Model Efficiency: Multi-Agent Inference with Large Models

論文の概要: Rethinking Model Efficiency: Multi-Agent Inference with Large Models

arxiv url: http://arxiv.org/abs/2604.04929v1
Date: Mon, 06 Apr 2026 17:59:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:19.336493
Title: Rethinking Model Efficiency: Multi-Agent Inference with Large Models
Title（参考訳）: モデル効率を再考する:大規模モデルを用いたマルチエージェント推論
Authors: Sixun Dong, Juhua Hu, Steven Li, Wei Wen, Qi Qian,
Abstract要約: 我々は、シミュレーションデータに基づいて、視覚言語モデル(VLM)の様々なコンポーネント間の遅延を包括的に解析する。実験により、出力トークンが少ない大きなモデルは、長い出力シーケンスを持つ小さなモデルよりも効率的であることが示されている。本稿では,大規模モデルを短い応答で保持するマルチエージェント推論フレームワークを提案するが,必要であれば,重要な推論トークンを小モデルから転送する。
参考スコア（独自算出の注目度）: 23.878724608444145
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Most vision-language models (VLMs) apply a large language model (LLM) as the decoder, where the response tokens are generated sequentially through autoregression. Therefore, the number of output tokens can be the bottleneck of the end-to-end latency. However, different models may require vastly different numbers of output tokens to achieve comparable performance. In this work, we conduct a comprehensive analysis of the latency across different components of VLMs on simulated data. The experiment shows that a large model with fewer output tokens can be more efficient than a small model with a long output sequence. The empirical study on diverse real-world benchmarks confirms the observation that a large model can achieve better or comparable performance as a small model with significantly fewer output tokens. To leverage the efficiency of large models, we propose a multi-agent inference framework that keeps large models with short responses but transfers the key reasoning tokens from the small model when necessary. The comparison on benchmark tasks demonstrates that by reusing the reasoning tokens from small models, it can help approach the performance of a large model with its own reasoning, which confirms the effectiveness of our proposal.
Abstract（参考訳）: ほとんどの視覚言語モデル(VLM)はデコーダとして大きな言語モデル(LLM)を適用しており、応答トークンは自動回帰によって順次生成される。したがって、出力トークンの数はエンドツーエンドのレイテンシのボトルネックになる可能性がある。しかし、同等のパフォーマンスを達成するためには、異なるモデルが異なる数の出力トークンを必要とする可能性がある。本研究では、シミュレーションデータに基づいて、VLMの異なるコンポーネント間での遅延の包括的解析を行う。実験により、出力トークンが少ない大モデルは、長い出力シーケンスを持つ小さなモデルよりも効率的であることが示されている。多様な実世界のベンチマークに関する実証的研究は、大きなモデルが、出力トークンが著しく少ない小さなモデルとして、より良い、または同等のパフォーマンスを達成できるという観察を裏付けている。大規模モデルの効率性を活用するために,大規模モデルを短時間で保持するマルチエージェント推論フレームワークを提案するが,必要であれば重要推論トークンを小モデルから転送する。ベンチマークタスクの比較は,小モデルからの推論トークンを再利用することで,大モデルの性能を独自の推論で評価し,提案手法の有効性を確認するのに役立つことを示す。

論文の概要: Rethinking Model Efficiency: Multi-Agent Inference with Large Models

関連論文リスト