Related papers: Beyond Words: How Large Language Models Perform in Quantitative Management Problem-Solving

Beyond Words: How Large Language Models Perform in Quantitative Management Problem-Solving

URL: http://arxiv.org/abs/2502.16556v1
Date: Sun, 23 Feb 2025 12:39:39 GMT
Title: Beyond Words: How Large Language Models Perform in Quantitative Management Problem-Solving
Authors: Jonathan Kuzmanko,
Abstract summary: This study examines how Large Language Models (LLMs) perform when tackling quantitative management decision problems in a zero-shot setting.<n>We generated 900 responses generated by five leading models across 20 diverse managerial scenarios.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This study examines how Large Language Models (LLMs) perform when tackling quantitative management decision problems in a zero-shot setting. Drawing on 900 responses generated by five leading models across 20 diverse managerial scenarios, our analysis explores whether these base models can deliver accurate numerical decisions under varying presentation formats, scenario complexities, and repeated attempts. Contrary to prior findings, we observed no significant effects of text presentation format (direct, narrative, or tabular) or text length on accuracy. However, scenario complexity -- particularly in terms of constraints and irrelevant parameters -- strongly influenced performance, often degrading accuracy. Surprisingly, the models handled tasks requiring multiple solution steps more effectively than expected. Notably, only 28.8\% of responses were exactly correct, highlighting limitations in precision. We further found no significant ``learning effect'' across iterations: performance remained stable across repeated queries. Nonetheless, significant variations emerged among the five tested LLMs, with some showing superior binary accuracy. Overall, these findings underscore both the promise and the pitfalls of harnessing LLMs for complex quantitative decision-making, informing managers and researchers about optimal deployment strategies.

Related papers

AutoEvoEval: An Automated Framework for Evolving Close-Ended LLM Evaluation Data [0.6278186810520364]
Large language models (LLMs) have shown remarkable performance on various tasks.<n>Existing evaluation benchmarks are often static and insufficient to fully assess their robustness and generalization.<n>We propose AutoEvoEval, an evolution-based evaluation framework for close-ended tasks such as question answering.
arXiv Detail & Related papers (2025-06-30T11:18:56Z)
Reasoning Capabilities and Invariability of Large Language Models [49.23570751696334]
We aim to provide a comprehensive analysis of Large Language Models' reasoning competence.<n>We introduce a new benchmark dataset with a series of simple reasoning questions demanding shallow logical reasoning.<n>An empirical analysis involving zero-shot and few-shot prompting across 24 LLMs of different sizes reveals that, while LLMs with over 70 billion parameters perform better in the zero-shot setting, there is still a large room for improvement.
arXiv Detail & Related papers (2025-05-01T18:12:30Z)
Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT? [59.418994222096885]
We conduct a detailed analysis of model performance on the AIME24 dataset. We categorize questions into four tiers (Easy, Medium, Hard, and Extremely Hard) We find that progression from Easy to Medium tier requires adopting an R1 reasoning style with minimal SFT-1K instances. Exh-level questions present a fundamentally different challenge; they require unconventional problem-solving skills.
arXiv Detail & Related papers (2025-04-16T03:39:38Z)
Out of Style: RAG's Fragility to Linguistic Variation [29.59506089890902]
User queries exhibit greater linguistic variations and can trigger cascading errors across interdependent RAG components. We analyze how varying four linguistic dimensions (formality, readability, politeness, and grammatical correctness) impact RAG performance.
arXiv Detail & Related papers (2025-04-11T03:30:26Z)
Learning LLM Preference over Intra-Dialogue Pairs: A Framework for Utterance-level Understandings [9.763273544617176]
Large language models (LLMs) have demonstrated remarkable capabilities in handling complex dialogue tasks without requiring use case-specific fine-tuning. In this paper, we introduce a simple yet effective framework to address this challenge. Our approach is specifically designed for per-utterance classification problems, which encompass tasks such as intent detection, dialogue state tracking, and more.
arXiv Detail & Related papers (2025-03-07T17:46:13Z)
Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework [64.83955753606443]
Math Word Problems serve as a crucial benchmark for evaluating Large Language Models' reasoning abilities.<n>Current error classification methods rely on static and predefined categories.<n>We introduce MWPES-300K, a comprehensive dataset containing 304,865 error samples.
arXiv Detail & Related papers (2025-01-26T16:17:57Z)
Verbosity $\neq$ Veracity: Demystify Verbosity Compensation Behavior of Large Language Models [8.846200844870767]
We discover an understudied type of undesirable behavior of Large Language Models (LLMs)<n>We term Verbosity Compensation (VC) as similar to the hesitation behavior of humans under uncertainty.<n>We propose a simple yet effective cascade algorithm that replaces verbose responses with the other model-generated responses.
arXiv Detail & Related papers (2024-11-12T15:15:20Z)
Optimal Query Allocation in Extractive QA with LLMs: A Learning-to-Defer Framework with Theoretical Guarantees [3.4289478404209826]
Large Language Models excel in generative tasks but exhibit inefficiencies in structured text selection.<n>We propose a Learning-to-Defer framework that allocates queries to specialized experts, ensuring high-confidence predictions.
arXiv Detail & Related papers (2024-10-21T08:21:00Z)
MM-R$^3$: On (In-)Consistency of Multi-modal Large Language Models (MLLMs) [26.475993408532304]
We study the ability of an MLLM model to produce semantically similar or identical responses to semantically similar queries. We propose the MM-R$3$ benchmark, which analyses the performance in terms of consistency and accuracy in SoTA MLLMs. Our analysis reveals that consistency does not always align with accuracy, indicating that models with higher accuracy are not necessarily more consistent, and vice versa.
arXiv Detail & Related papers (2024-10-07T06:36:55Z)
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection [60.297079601066784]
We introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in error detection. ErrorRadar evaluates two sub-tasks: error step identification and error categorization. It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions. Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation.
arXiv Detail & Related papers (2024-10-06T14:59:09Z)
On the Worst Prompt Performance of Large Language Models [93.13542053835542]
Performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts. We introduce RobustAlpacaEval, a new benchmark that consists of semantically equivalent case-level queries. Experiments on RobustAlpacaEval with ChatGPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover substantial variability in model performance.
arXiv Detail & Related papers (2024-06-08T13:40:38Z)
Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios. We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples. Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z)
The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance. Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes. We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z)
On Context Utilization in Summarization with Large Language Models [83.84459732796302]
Large language models (LLMs) excel in abstractive summarization tasks, delivering fluent and pertinent summaries. Recent advancements have extended their capabilities to handle long-input contexts, exceeding 100k tokens. We conduct the first comprehensive study on context utilization and position bias in summarization.
arXiv Detail & Related papers (2023-10-16T16:45:12Z)
Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning [14.663216851932646]
We show that language models tend to perform fairly well at single step inference tasks, but struggle to chain together multiple reasoning steps to solve more complex problems. We propose a Selection-Inference (SI) framework that exploits pre-trained LLMs as general processing modules. We show that a 7B parameter LLM used within the SI framework in a 5-shot generalisation setting, with no fine-tuning, yields a performance improvement of over 100%.
arXiv Detail & Related papers (2022-05-19T17:25:28Z)
Logic-Guided Data Augmentation and Regularization for Consistent Question Answering [55.05667583529711]
This paper addresses the problem of improving the accuracy and consistency of responses to comparison questions. Our method leverages logical and linguistic knowledge to augment labeled training data and then uses a consistency-based regularizer to train the model.
arXiv Detail & Related papers (2020-04-21T17:03:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.