Fugu-MT 論文翻訳(概要): Evaluating Robustness of Large Language Models in Enterprise Applications: Benchmarks for Perturbation Consistency Across Formats and Languages

論文の概要: Evaluating Robustness of Large Language Models in Enterprise Applications: Benchmarks for Perturbation Consistency Across Formats and Languages

arxiv url: http://arxiv.org/abs/2601.06341v1
Date: Fri, 09 Jan 2026 22:26:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-13 19:08:00.760263
Title: Evaluating Robustness of Large Language Models in Enterprise Applications: Benchmarks for Perturbation Consistency Across Formats and Languages
Title（参考訳）: エンタープライズアプリケーションにおける大規模言語モデルのロバスト性の評価: フォーマットと言語間の摂動一貫性のベンチマーク
Authors: Tara Bogavelli, Oluwanifemi Bamgbose, Gabrielle Gauthier Melançon, Fanny Riols, Roshnee Sharma,
Abstract要約: 小さな急激な変化でさえ、出力にかなりの違いをもたらす可能性がある。複数の摂動型にまたがるロバスト性を評価するベンチマークスイートを提案する。マイナーな摂動は、主要な企業メトリクスの最大40パーセントのパフォーマンスを低下させることに気付きました。
参考スコア（独自算出の注目度）: 0.8895014147059547
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Enterprise LLM applications require consistently high quality and reliable performance across diverse scenarios, demanding robustness to minor variations. Existing research shows that even small prompt changes can lead to substantial differences in output, but has mainly focused on a narrow set of perturbations with small academic datasets, limiting their relevance to real-world applications. To address this, we present a comprehensive benchmark suite that evaluates robustness across multiple perturbation types, including general text edits (e.g., punctuation, whitespace), formatting changes (e.g., JSON, YAML), multilingual and cross-lingual inputs, and positional variations in instructions. Evaluating 11 models ranging from 4B to 120B+ parameters, we find that minor perturbations reduce performance by up to 40 percentage points on key enterprise metrics. Critically, we demonstrate that the relationship between model size and robustness is more nuanced than conventional assumptions suggest: an 8B parameter model (Ministral 3 8B) outperforms most larger models, while another 8B model (Llama 3.1 8B) performs worst overall.
Abstract（参考訳）: エンタープライズLLMアプリケーションは、様々なシナリオにまたがって一貫して高品質で信頼性の高いパフォーマンスを必要とし、小さなバリエーションに対して堅牢性を必要とします。既存の研究では、たとえ小さな急激な変更であっても、出力に大きな違いをもたらす可能性があるが、主に小さな学術的なデータセットによる摂動の狭いセットに焦点を当てており、現実世界のアプリケーションとの関係を制限している。これを解決するために、汎用テキスト編集(例、句読点、空白)、フォーマット変更(例、JSON、YAML)、多言語入力と多言語入力、命令の位置変化など、複数の摂動タイプにわたる堅牢性を評価する包括的なベンチマークスイートを提案する。 4Bから120B以上のパラメータを含む11のモデルを評価すると、主要なエンタープライズメトリクスにおいて、小さな摂動によってパフォーマンスが最大40パーセント低下することがわかった。 8Bパラメータモデル(ミニストラム3、8B)は、最も大きなモデルよりも優れており、他の8Bモデル(ラマ3.18B)は、全体としては最悪である。

論文の概要: Evaluating Robustness of Large Language Models in Enterprise Applications: Benchmarks for Perturbation Consistency Across Formats and Languages

関連論文リスト