Related papers: Evaluating LLM Reasoning in the Operations Research Domain with ORQA

Evaluating LLM Reasoning in the Operations Research Domain with ORQA

URL: http://arxiv.org/abs/2412.17874v2
Date: Sun, 09 Feb 2025 16:39:50 GMT
Title: Evaluating LLM Reasoning in the Operations Research Domain with ORQA
Authors: Mahdi Mostajabdaveh, Timothy T. Yu, Samarendra Chandan Bindu Dash, Rindranirina Ramamonjison, Jabo Serge Byusa, Giuseppe Carenini, Zirui Zhou, Yong Zhang,
Abstract summary: We introduce and apply Operations Research Question Answering (ORQA), a new benchmark designed to assess the generalization capabilities of Large Language Models (LLMs)<n>The dataset features real-world optimization problems that demand multistep reasoning to construct their mathematical models.<n>Our evaluations of various open source LLMs, such as LLaMA 3.1, DeepSeek, and Mixtral, reveal their modest performance, highlighting a gap in their ability to generalize to specialized technical domains.
Score: 19.72699080797411
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we introduce and apply Operations Research Question Answering (ORQA), a new benchmark designed to assess the generalization capabilities of Large Language Models (LLMs) in the specialized technical domain of Operations Research (OR). This benchmark evaluates whether LLMs can emulate the knowledge and reasoning skills of OR experts when confronted with diverse and complex optimization problems. The dataset, developed by OR experts, features real-world optimization problems that demand multistep reasoning to construct their mathematical models. Our evaluations of various open source LLMs, such as LLaMA 3.1, DeepSeek, and Mixtral, reveal their modest performance, highlighting a gap in their ability to generalize to specialized technical domains. This work contributes to the ongoing discourse on LLMs generalization capabilities, offering valuable insights for future research in this area. The dataset and evaluation code are publicly available.

Related papers

Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective [53.594353527056775]
We propose Chinese Commonsense Multi-hop Reasoning ( CCMOR) to evaluate Large Language Models (LLMs)<n> CCMOR is designed to evaluate LLMs' ability to integrate Chinese-specific factual knowledge with multi-step logical reasoning.<n>We implement a human-in-the-loop verification system, where domain experts systematically validate and refine the generated questions.
arXiv Detail & Related papers (2025-10-09T20:29:00Z)
Performance of LLMs on Stochastic Modeling Operations Research Problems: From Theory to Practice [18.040849771712093]
Large language models (LLMs) have exhibited expert-level capabilities across various domains.<n>However, their abilities to solve problems in Operations Research (OR) remain underexplored.
arXiv Detail & Related papers (2025-06-30T14:54:15Z)
Domain Specific Benchmarks for Evaluating Multimodal Large Language Models [3.1546387965618337]
Large language models (LLMs) are increasingly being deployed across disciplines due to their advanced reasoning and problem solving capabilities.<n>This paper introduces a taxonomy of seven key disciplines, encompassing various domains and application areas where LLMs are extensively utilized.<n>We compile and categorize these benchmarks by domain to create an accessible resource for researchers.
arXiv Detail & Related papers (2025-06-15T20:42:45Z)
General-Reasoner: Advancing LLM Reasoning Across All Domains [64.70599911897595]
Reinforcement learning (RL) has recently demonstrated strong potential in enhancing the reasoning capabilities of large language models (LLMs)<n>We propose General-Reasoner, a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains.<n>We train a series of models and evaluate them on a wide range of datasets covering wide domains like physics, chemistry, finance, electronics etc.
arXiv Detail & Related papers (2025-05-20T17:41:33Z)
Injecting Domain-Specific Knowledge into Large Language Models: A Comprehensive Survey [39.82566660592583]
Large Language Models (LLMs) have demonstrated remarkable success in various tasks such as natural language understanding, text summarization, and machine translation. Their general-purpose nature often limits their effectiveness in domain-specific applications that require specialized knowledge, such as healthcare, chemistry, or legal analysis. To address this, researchers have explored diverse methods to enhance LLMs by integrating domain-specific knowledge.
arXiv Detail & Related papers (2025-02-15T07:43:43Z)
EVOLvE: Evaluating and Optimizing LLMs For Exploration [76.66831821738927]
Large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty. We measure LLMs' (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications. Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs.
arXiv Detail & Related papers (2024-10-08T17:54:03Z)
Are Expert-Level Language Models Expert-Level Annotators? [17.06186816803593]
This work investigates the extent to which LLMs as data annotators perform in domains requiring expert knowledge. To the best of our knowledge, we present the first systematic evaluation of LLMs as expert-level data annotators.
arXiv Detail & Related papers (2024-10-04T09:17:09Z)
Exploring Language Model Generalization in Low-Resource Extractive QA [57.14068405860034]
We investigate Extractive Question Answering (EQA) with Large Language Models (LLMs) under domain drift.<n>We devise a series of experiments to explain the performance gap empirically.
arXiv Detail & Related papers (2024-09-27T05:06:43Z)
Exploring the True Potential: Evaluating the Black-box Optimization Capability of Large Language Models [32.859634302766146]
Large language models (LLMs) have demonstrated exceptional performance in natural language processing tasks. This paper endeavors to offer deep insights into the potential of LLMs in optimization. Our findings reveal both the limitations and advantages of LLMs in optimization.
arXiv Detail & Related papers (2024-04-09T13:17:28Z)
LLM Inference Unveiled: Survey and Roofline Model Insights [62.92811060490876]
Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges. Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model. This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems.
arXiv Detail & Related papers (2024-02-26T07:33:05Z)
Evolutionary Computation in the Era of Large Language Model: Survey and Roadmap [26.959633651475016]
The interplay between large language models (LLMs) and evolutionary algorithms (EAs) share a common pursuit of applicability in complex problems. The abundant domain knowledge inherent in LLMs could enable EA to conduct more intelligent searches. This paper provides a thorough review and a forward-looking roadmap, categorizing the reciprocal inspiration into two main avenues.
arXiv Detail & Related papers (2024-01-18T14:58:17Z)
Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions [47.83142414018448]
We focus on two popular reasoning tasks: arithmetic reasoning and code generation. We introduce (i) a general ontology of perturbations for math and coding questions, (ii) a semi-automatic method to apply these perturbations, and (iii) two datasets. We show a significant performance drop across all the models against perturbed questions.
arXiv Detail & Related papers (2024-01-17T18:13:07Z)
Tapping the Potential of Large Language Models as Recommender Systems: A Comprehensive Framework and Empirical Analysis [91.5632751731927]
Large Language Models such as ChatGPT have showcased remarkable abilities in solving general tasks. We propose a general framework for utilizing LLMs in recommendation tasks, focusing on the capabilities of LLMs as recommenders. We analyze the impact of public availability, tuning strategies, model architecture, parameter scale, and context length on recommendation results.
arXiv Detail & Related papers (2024-01-10T08:28:56Z)
Knowledge Plugins: Enhancing Large Language Models for Domain-Specific Recommendations [50.81844184210381]
We propose a general paradigm that augments large language models with DOmain-specific KnowledgE to enhance their performance on practical applications, namely DOKE. This paradigm relies on a domain knowledge extractor, working in three steps: 1) preparing effective knowledge for the task; 2) selecting the knowledge for each specific sample; and 3) expressing the knowledge in an LLM-understandable way.
arXiv Detail & Related papers (2023-11-16T07:09:38Z)
Exploring the Potential of Large Language Models in Computational Argumentation [54.85665903448207]
Large language models (LLMs) have demonstrated impressive capabilities in understanding context and generating natural language. This work aims to embark on an assessment of LLMs, such as ChatGPT, Flan models, and LLaMA2 models, in both zero-shot and few-shot settings.
arXiv Detail & Related papers (2023-11-15T15:12:15Z)
Through the Lens of Core Competency: Survey on Evaluation of Large Language Models [27.271533306818732]
Large language model (LLM) has excellent performance and wide practical uses. Existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios. We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety. Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system.
arXiv Detail & Related papers (2023-08-15T17:40:34Z)
Domain Specialization as the Key to Make Large Language Models Disruptive: A Comprehensive Survey [100.24095818099522]
Large language models (LLMs) have significantly advanced the field of natural language processing (NLP) They provide a highly useful, task-agnostic foundation for a wide range of applications. However, directly applying LLMs to solve sophisticated problems in specific domains meets many hurdles.
arXiv Detail & Related papers (2023-05-30T03:00:30Z)
Empower Large Language Model to Perform Better on Industrial Domain-Specific Question Answering [36.31193273252256]
Large Language Model (LLM) has gained popularity and achieved remarkable results in open-domain tasks. But its performance in real industrial domain-specific scenarios is average due to its lack of specific domain knowledge. We provide a benchmark Question Answering (QA) dataset named MSQA, centered around Microsoft products and IT technical problems encountered by customers.
arXiv Detail & Related papers (2023-05-19T09:23:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.