Fugu-MT 論文翻訳(概要): LongFuncEval: Measuring the effectiveness of long context models for function calling

論文の概要: LongFuncEval: Measuring the effectiveness of long context models for function calling

arxiv url: http://arxiv.org/abs/2505.10570v1
Date: Wed, 30 Apr 2025 15:21:51 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-25 10:52:49.025544
Title: LongFuncEval: Measuring the effectiveness of long context models for function calling
Title（参考訳）: LongFuncEval: 関数呼び出しのためのロングコンテキストモデルの有効性の測定
Authors: Kiran Kate, Tejaswini Pedapati, Kinjal Basu, Yara Rizk, Vijil Chenthamarakshan, Subhajit Chaudhury, Mayank Agarwal, Ibrahim Abdelaziz,
Abstract要約: ツールコール設定において,大規模言語モデルの長い文脈理解能力を包括的に研究するための最初の試みを行う。ツール数の増加に伴い,パフォーマンス低下が7%から85%,ツール応答が長くなるにつれて回答検索が7%から91%,マルチターン会話が長くなるにつれて13%と40%の低下が見られた。
参考スコア（独自算出の注目度）: 22.799185431614656
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multiple recent studies have documented large language models' (LLMs) performance on calling external tools/functions. Others focused on LLMs' abilities to handle longer context lengths. At the intersection of these areas lies another interesting problem: LLMs' abilities to accurately perform function calls in long context settings. Particularly, when calling tools, LLMs are encumbered by three predominant challenges: (1) a large catalog of tools, (2) long responses from the tool APIs, and (3) long multi-turn conversations. These challenges are particularly relevant to enterprise applications of LLMs which engage in multi-turn conversations with users to complete complex tasks that require a large catalog of complex tools. The literature contains multiple investigations of long context challenges such as lost in the middle or needle in the haystack for natural language tasks. In this paper, we make the first attempt to comprehensively study the long context understanding capabilities of these models in the tool calling setup. We modify existing benchmarks for challenge 1 and 3, and create a new evaluation set for challenge 2 to enable this analysis. We gradually increase the input context length and also vary the position of the answer in the input. When evaluated with several long context models, we observe a performance drop of 7% to 85% as the number of tools increases, a 7% to 91% degradation in answer retrieval as the tool responses length increases, and 13% and 40% degradation for as multi-turn conversations get longer. Our study shows that LLMs still struggle with long context in tool calling settings, motivating future research to drive further LLM improvements.
Abstract（参考訳）: 複数の最近の研究で、外部ツールや関数を呼び出す上での大規模言語モデル(LLM)のパフォーマンスが文書化されている。他の人々は、長いコンテキスト長を扱うLLMの能力に焦点を当てた。長いコンテキスト設定で関数呼び出しを正確に実行するLLMの能力。特に、ツールを呼び出す際には、(1)ツールの大規模なカタログ、(2)ツールAPIからの長いレスポンス、(3)長いマルチターン会話の3つの主要な課題に悩まされる。これらの課題は、複雑なツールの大規模なカタログを必要とする複雑なタスクを完了するために、ユーザとのマルチターン会話を行うLLMのエンタープライズアプリケーションに特に関係している。この文献には、自然言語処理のための干し草スタックで失った中や針など、長い文脈の課題に関する複数の調査が含まれている。本稿では,ツールコール設定において,これらのモデルの長期文脈理解能力を包括的に研究するための最初の試みを行う。既存のベンチマークを1と3に修正し、新しい評価セットを作成して、この分析を可能にする。我々は、入力コンテキストの長さを徐々に増加させ、また、入力における応答の位置も変化させる。いくつかの長期的文脈モデルを用いて評価すると,ツール数の増加に伴い7%から85%のパフォーマンス低下,ツール応答長の増加に伴って回答検索が7%から91%,マルチターン会話が長くなるにつれて13%から40%の低下が見られた。我々の研究は、LLMがツールコール設定における長いコンテキストに苦しむことを示し、LLMをさらに改善するための将来の研究を動機付けている。

論文の概要: LongFuncEval: Measuring the effectiveness of long context models for function calling

関連論文リスト