Evaluating Robustness of Large Language Models in Enterprise Applications: Benchmarks for Perturbation Consistency Across Formats and Languages
- URL: http://arxiv.org/abs/2601.06341v1
- Date: Fri, 09 Jan 2026 22:26:31 GMT
- Title: Evaluating Robustness of Large Language Models in Enterprise Applications: Benchmarks for Perturbation Consistency Across Formats and Languages
- Authors: Tara Bogavelli, Oluwanifemi Bamgbose, Gabrielle Gauthier Melançon, Fanny Riols, Roshnee Sharma,
- Abstract summary: Even small prompt changes can lead to substantial differences in output.<n>We present a benchmark suite that evaluates robustness across multiple perturbation types.<n>We find that minor perturbations reduce performance by up to 40 percentage points on key enterprise metrics.
- Score: 0.8895014147059547
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Enterprise LLM applications require consistently high quality and reliable performance across diverse scenarios, demanding robustness to minor variations. Existing research shows that even small prompt changes can lead to substantial differences in output, but has mainly focused on a narrow set of perturbations with small academic datasets, limiting their relevance to real-world applications. To address this, we present a comprehensive benchmark suite that evaluates robustness across multiple perturbation types, including general text edits (e.g., punctuation, whitespace), formatting changes (e.g., JSON, YAML), multilingual and cross-lingual inputs, and positional variations in instructions. Evaluating 11 models ranging from 4B to 120B+ parameters, we find that minor perturbations reduce performance by up to 40 percentage points on key enterprise metrics. Critically, we demonstrate that the relationship between model size and robustness is more nuanced than conventional assumptions suggest: an 8B parameter model (Ministral 3 8B) outperforms most larger models, while another 8B model (Llama 3.1 8B) performs worst overall.
Related papers
- Llama-Embed-Nemotron-8B: A Universal Text Embedding Model for Multilingual and Cross-Lingual Tasks [4.739000717606982]
llama-embed-nemotron-8b is an open-weights text embedding model.<n>It achieves state-of-the-art performance on the Multilingual Massive Text Embedding Benchmark.
arXiv Detail & Related papers (2025-11-10T12:13:16Z) - Retrofitting Small Multilingual Models for Retrieval: Matching 7B Performance with 300M Parameters [30.737678658069097]
Training effective multilingual embedding models presents unique challenges due to the diversity of languages and task objectives.<n>We investigate key factors that influence the effectiveness of multilingual embeddings, focusing on training data scale, negative sampling strategies, and data diversity.<n>We develop a compact (approximately 300M) multilingual model that achieves retrieval performance comparable to or even surpassing current strong 7B models.
arXiv Detail & Related papers (2025-10-16T03:48:59Z) - Evaluating NL2SQL via SQL2NL [45.88028371034407]
New framework generates semantically equivalent, lexically diverse queries.<n>State-of-the-art models are far more brittle than standard benchmarks suggest.
arXiv Detail & Related papers (2025-09-04T21:03:59Z) - Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning [73.73967342609603]
We introduce a predictor-corrector learning framework to minimize truncation errors.
We also propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor.
Our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters.
arXiv Detail & Related papers (2024-11-05T12:26:25Z) - "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
Quantization is a powerful tool for accelerating large language model (LLM) inference, but the accuracy-performance trade-offs across different formats remain unclear.<n>We conduct the most comprehensive empirical study to date, evaluating FP8, INT8, and INT4 quantization across academic benchmarks and real-world tasks.
arXiv Detail & Related papers (2024-11-04T18:21:59Z) - On the Worst Prompt Performance of Large Language Models [93.13542053835542]
Performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts.
We introduce RobustAlpacaEval, a new benchmark that consists of semantically equivalent case-level queries.
Experiments on RobustAlpacaEval with ChatGPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover substantial variability in model performance.
arXiv Detail & Related papers (2024-06-08T13:40:38Z) - Comparing Specialised Small and General Large Language Models on Text Classification: 100 Labelled Samples to Achieve Break-Even Performance [5.009377915313077]
We show that the specialised models often need only few samples (on average $100$) to be on par or better than the general ones.<n>When performance variance is taken into consideration, the number of required labels increases on average by $100 - 200%$.<n>Larger models do not consistently lead to better performance and lower variance, with 4-bit quantisation having negligible impact.
arXiv Detail & Related papers (2024-02-20T08:38:24Z) - MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large
Language Models [70.92847554971065]
We introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn conversational abilities.
By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up.
Our evaluation of 11 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks.
arXiv Detail & Related papers (2024-01-30T04:50:28Z) - Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting [68.19544657508509]
Large language models (LLMs) are adopted as a fundamental component of language technologies.
We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt format in few-shot settings.
We propose an algorithm that rapidly evaluates a sampled set of plausible prompt formats for a given task, and reports the interval of expected performance without accessing model weights.
arXiv Detail & Related papers (2023-10-17T15:03:30Z) - PLATON: Pruning Large Transformer Models with Upper Confidence Bound of
Weight Importance [114.1541203743303]
We propose PLATON, which captures the uncertainty of importance scores by upper confidence bound (UCB) of importance estimation.
We conduct extensive experiments with several Transformer-based models on natural language understanding, question answering and image classification.
arXiv Detail & Related papers (2022-06-25T05:38:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.