Fugu-MT 論文翻訳(概要): SimpleTool: Parallel Decoding for Real-Time LLM Function Calling

論文の概要: SimpleTool: Parallel Decoding for Real-Time LLM Function Calling

arxiv url: http://arxiv.org/abs/2603.00030v1
Date: Wed, 04 Feb 2026 08:58:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 01:20:07.977565
Title: SimpleTool: Parallel Decoding for Real-Time LLM Function Calling
Title（参考訳）: SimpleTool: リアルタイムLLM関数呼び出しのための並列デコーディング
Authors: Xiaoxin Shi, Jiaxin Wan, Linkang Dong, Wei Jiang, Yue Liu, Zengfeng Huang,
Abstract要約: SimpleToolは3-6倍のスピードアップ(最大9.6倍)を実現し、並列化オーバーヘッドは+8.2%である。 Mobile Actionsでは、ST-Qwen-0.5BはGoogleのFunctionGemmaよりも精度とレイテンシの一貫性が優れている。
参考スコア（独自算出の注目度）: 21.7429929239065
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: LLM-based function calling enables intelligent agents to interact with external tools and environments, yet autoregressive decoding imposes a fundamental latency bottleneck that limits real-time applications such as embodied intelligence, game AI, and interactive avatars (e.g., 10 Hz control frequency). We observe that function calling differs fundamentally from free-form text generation: structured outputs exhibit substantial token redundancy (delimiters, parameter names), and arguments exhibit weak causal dependencies. Crucially, these two properties must be exploited jointly to achieve real-time performance. We present SimpleTool, which introduces special tokens that serve a dual role: compressing low-entropy tokens (4-6x reduction) while acting as mode selectors that enable independent parallel generation of function name and arguments. This synergistic design achieves 3-6x end-to-end speedup (up to 9.6x) with only +8.2% parallelization overhead. Experiments on five benchmarks across Qwen-series models (0.5B-14B) demonstrate substantial speedup while maintaining competitive or improved accuracy. On Mobile Actions, ST-Qwen-0.5B outperforms Google's FunctionGemma in both accuracy and latency consistency. With quantization on consumer-grade GPU, SimpleTool achieves 61.2ms P50 latency, enabling 16 Hz real-time control at 4B model scale, bridging the gap between LLM function calling and latency-critical real-world deployment.
Abstract（参考訳）: LLMベースの関数呼び出しは、インテリジェントエージェントが外部のツールや環境と対話することを可能にするが、自動回帰復号化は、エンボディインテリジェンス、ゲームAI、インタラクティブアバター(例えば10Hzの制御周波数)といったリアルタイムアプリケーションを制限する基本的な遅延ボトルネックを課す。構造化された出力は実質的なトークン冗長性(デリミタ、パラメータ名)を示し、引数は因果依存性が弱い。重要なことに、これらの2つの特性はリアルタイムのパフォーマンスを達成するために共同で利用されなければならない。関数名と引数の独立並列生成を可能にするモードセレクタとして機能しながら、低エントロピートークン(4-6倍の削減)を圧縮する。この相乗的設計は3-6倍のスピードアップ(最大9.6倍)を実現し、並列化オーバーヘッドは+8.2%である。 Qwenシリーズモデル(0.5B-14B)の5つのベンチマーク実験は、競争力や精度の向上を維持しながら、かなりのスピードアップを示した。 Mobile Actionsでは、ST-Qwen-0.5BはGoogleのFunctionGemmaよりも精度とレイテンシの一貫性が優れている。コンシューマグレードのGPU上での量子化により、SimpleToolは61.2msのP50レイテンシを実現し、4Bモデルスケールでの16Hzのリアルタイム制御を可能にし、LLM関数呼び出しとレイテンシクリティカルな現実世界のデプロイのギャップを埋める。

論文の概要: SimpleTool: Parallel Decoding for Real-Time LLM Function Calling

関連論文リスト