Reasoning in a Combinatorial and Constrained World: Benchmarking LLMs on Natural-Language Combinatorial Optimization
- URL: http://arxiv.org/abs/2602.02188v1
- Date: Mon, 02 Feb 2026 14:55:48 GMT
- Title: Reasoning in a Combinatorial and Constrained World: Benchmarking LLMs on Natural-Language Combinatorial Optimization
- Authors: Xia Jiang, Jing Chen, Cong Zhang, Jie Gao, Chengpeng Hu, Chenhao Zhang, Yaoxin Wu, Yingqian Zhang,
- Abstract summary: Large language models (LLMs) have shown strong performance in math and logic reasoning.<n>But their ability to handle systematic optimization (CO) remains underexplored.<n>We introduce NLCO, a benchmark that evaluates LLMs on end-to-end CO reasoning.
- Score: 28.52469449694436
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: While large language models (LLMs) have shown strong performance in math and logic reasoning, their ability to handle combinatorial optimization (CO) -- searching high-dimensional solution spaces under hard constraints -- remains underexplored. To bridge the gap, we introduce NLCO, a \textbf{N}atural \textbf{L}anguage \textbf{C}ombinatorial \textbf{O}ptimization benchmark that evaluates LLMs on end-to-end CO reasoning: given a language-described decision-making scenario, the model must output a discrete solution without writing code or calling external solvers. NLCO covers 43 CO problems and is organized using a four-layer taxonomy of variable types, constraint families, global patterns, and objective classes, enabling fine-grained evaluation. We provide solver-annotated solutions and comprehensively evaluate LLMs by feasibility, solution optimality, and reasoning efficiency. Experiments across a wide range of modern LLMs show that high-performing models achieve strong feasibility and solution quality on small instances, but both degrade as instance size grows, even if more tokens are used for reasoning. We also observe systematic effects across the taxonomy: set-based tasks are relatively easy, whereas graph-structured problems and bottleneck objectives lead to more frequent failures.
Related papers
- Multi-Objective Hierarchical Optimization with Large Language Models [41.41567058185742]
Large Language Models (LLMs) are not the off-the-shelf choice to drive multi-objective optimization yet.<n>In this paper, we close this gap by leveraging LLMs as surrogate models and candidate samplers inside a structured hierarchical search strategy.
arXiv Detail & Related papers (2026-01-20T12:10:13Z) - DRAGON: LLM-Driven Decomposition and Reconstruction Agents for Large-Scale Combinatorial Optimization [40.88623618289683]
Large Language Models (LLMs) have recently shown promise in addressing optimization problems (COPs) through prompt-based strategies.<n>We propose DRAGON, which combines the strengths of metaheuristic design and LLM reasoning.<n>By continuously interacting with the optimization environment and leveraging an adaptive experience memory, the agents iteratively learn from feedback.
arXiv Detail & Related papers (2026-01-10T09:31:40Z) - Large Language Models as End-to-end Combinatorial Optimization Solvers [45.32050615257007]
Combinatorial optimization (CO) problems, central to decision-making scenarios like logistics and manufacturing, are traditionally solved using problem-specific algorithms.<n>Existing approaches rely on intermediate steps such as code generation or solver invocation, limiting their generality and accessibility.<n>This paper introduces a novel framework that empowers large language models (LLMs) to serve as end-to-end CO solvers by directly mapping natural language problem descriptions to solutions.
arXiv Detail & Related papers (2025-09-21T01:30:30Z) - Reasoning with Preference Constraints: A Benchmark for Language Models in Many-to-One Matching Markets [13.111181135818184]
Large language models (LLMs) have shown strong performance on complex mathematical tasks, including optimization.<n>Applying LLMs to matching problems, which require reasoning under preferential and structural constraints, remains underexplored.<n>We employ a novel benchmark of 369 instances of the College Admission Problem to evaluate LLMs across key dimensions: feasibility, stability, and optimality.
arXiv Detail & Related papers (2025-09-16T14:48:46Z) - From Natural Language to Solver-Ready Power System Optimization: An LLM-Assisted, Validation-in-the-Loop Framework [1.7136832159667206]
This paper introduces a novel Large Language Models (LLMs)-assisted agent that automatically converts natural-language descriptions of power system optimization scenarios into compact, solver-ready formulations.<n>The proposed method focuses on discovering a mathematically compatible formulation that can be efficiently solved by off-the-shelf optimization solvers.
arXiv Detail & Related papers (2025-08-11T16:22:57Z) - CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation [53.452699232071495]
We introduce CrossWordBench, a benchmark designed to evaluate the reasoning capabilities of Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) through the medium of crossword puzzles.<n>Our evaluation reveals that reasoning LLMs substantially outperform non-reasoning models by effectively leveraging crossing-letter constraints.<n>Our findings highlight limitations of the reasoning capabilities of current LLMs and LVLMs, and provide an effective approach for creating multimodal constrained tasks for future evaluations.
arXiv Detail & Related papers (2025-03-30T20:03:36Z) - ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning [92.76959707441954]
We introduce ZebraLogic, a comprehensive evaluation framework for assessing LLM reasoning performance.<n>ZebraLogic enables the generation of puzzles with controllable and quantifiable complexity.<n>Our results reveal a significant decline in accuracy as problem complexity grows.
arXiv Detail & Related papers (2025-02-03T06:44:49Z) - In-context Demonstration Matters: On Prompt Optimization for Pseudo-Supervision Refinement [71.60563181678323]
Large language models (LLMs) have achieved great success across diverse tasks, and fine-tuning is sometimes needed to further enhance generation quality.<n>To handle these challenges, a direct solution is to generate high-confidence'' data from unsupervised downstream tasks.<n>We propose a novel approach, pseudo-supervised demonstrations aligned prompt optimization (PAPO) algorithm, which jointly refines both the prompt and the overall pseudo-supervision.
arXiv Detail & Related papers (2024-10-04T03:39:28Z) - NPHardEval4V: Dynamic Evaluation of Large Vision-Language Models with Effects of Vision [64.83085920775316]
We introduce NPHardEval4V, a multimodal benchmark suite grounded in four classical NP-hard problems.<n>Each task is presented through a combination of structured visual layouts and textual prompts, designed to assess the ability of LVLMs to perform reasoning under visual-linguistic constraints.<n>Our results show that while these models perform reasonably well on perception-based inputs, they struggle with global optimization, abstraction, and constraint satisfaction.
arXiv Detail & Related papers (2024-03-04T07:10:31Z) - Let the Flows Tell: Solving Graph Combinatorial Optimization Problems
with GFlowNets [86.43523688236077]
Combinatorial optimization (CO) problems are often NP-hard and out of reach for exact algorithms.
GFlowNets have emerged as a powerful machinery to efficiently sample from composite unnormalized densities sequentially.
In this paper, we design Markov decision processes (MDPs) for different problems and propose to train conditional GFlowNets to sample from the solution space.
arXiv Detail & Related papers (2023-05-26T15:13:09Z) - A Bi-Level Framework for Learning to Solve Combinatorial Optimization on
Graphs [91.07247251502564]
We propose a hybrid approach to combine the best of the two worlds, in which a bi-level framework is developed with an upper-level learning method to optimize the graph.
Such a bi-level approach simplifies the learning on the original hard CO and can effectively mitigate the demand for model capacity.
arXiv Detail & Related papers (2021-06-09T09:18:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.