Related papers: SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers

SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers

URL: http://arxiv.org/abs/2502.20545v1
Date: Thu, 27 Feb 2025 21:41:43 GMT
Title: SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers
Authors: Kechen Li, Wenqi Zhu, Coralia Cartis, Tianbo Ji, Shiwei Liu,
Abstract summary: Large Language Models (LLMs) have achieved human-level proficiency across diverse tasks, but their ability to perform rigorous mathematical problem solving remains an open challenge.<n>In this work, we investigate a fundamental yet intractable problem: determining whether a given 1.8% is nonnegative.<n>Our findings highlight the potential of LLMs to push the boundaries of mathematical reasoning and tackle NP-hard problems.
Score: 17.326575243638437
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have achieved human-level proficiency across diverse tasks, but their ability to perform rigorous mathematical problem solving remains an open challenge. In this work, we investigate a fundamental yet computationally intractable problem: determining whether a given multivariate polynomial is nonnegative. This problem, closely related to Hilbert's Seventeenth Problem, plays a crucial role in global polynomial optimization and has applications in various fields. First, we introduce SoS-1K, a meticulously curated dataset of approximately 1,000 polynomials, along with expert-designed reasoning instructions based on five progressively challenging criteria. Evaluating multiple state-of-the-art LLMs, we find that without structured guidance, all models perform only slightly above the random guess baseline 50%. However, high-quality reasoning instructions significantly improve accuracy, boosting performance up to 81%. Furthermore, our 7B model, SoS-7B, fine-tuned on SoS-1K for just 4 hours, outperforms the 671B DeepSeek-V3 and GPT-4o-mini in accuracy while only requiring 1.8% and 5% of the computation time needed for letters, respectively. Our findings highlight the potential of LLMs to push the boundaries of mathematical reasoning and tackle NP-hard problems.

Related papers

Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem [53.3188041952701]
We show that Critique Fine-Tuning (CFT) on only one problem can effectively unleash the reasoning potential of LLMs.<n>With just 5 GPU hours of training, Qwen-Math-7B-CFT show an average improvement of 15% on six math benchmarks and 16% on three logic reasoning benchmarks.<n>Results are comparable to or even surpass the results from RL with 20x less compute.
arXiv Detail & Related papers (2025-06-03T18:35:52Z)
Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT? [59.418994222096885]
We conduct a detailed analysis of model performance on the AIME24 dataset. We categorize questions into four tiers (Easy, Medium, Hard, and Extremely Hard) We find that progression from Easy to Medium tier requires adopting an R1 reasoning style with minimal SFT-1K instances. Exh-level questions present a fundamentally different challenge; they require unconventional problem-solving skills.
arXiv Detail & Related papers (2025-04-16T03:39:38Z)
Large Language Models in Numberland: A Quick Test of Their Numerical Reasoning Abilities [0.0]
"Numberland" is a 100-problem test to evaluate the numerical reasoning abilities of LLM-based agents. We evaluated five LLM-based agents: OpenAI's o1 and o1-mini, Google Gemini, Microsoft Copilot, and Anthropic Claude. We tested the top 24 solver (o1 with 73% accuracy) on 25 harder problems, and its score fell to 27%, confirming search as a bottleneck.
arXiv Detail & Related papers (2025-03-31T21:06:39Z)
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling [69.57918638435491]
Test-Time Scaling is an important method for improving the performance of Large Language Models. This paper focuses on two core questions: What is the optimal approach to scale test-time computation across different policy models, PRMs, and problem difficulty levels? We show that with our compute-optimal TTS strategy, extremely small policy models can outperform larger models.
arXiv Detail & Related papers (2025-02-10T17:30:23Z)
HARP: A challenging human-annotated math reasoning benchmark [7.691786865279827]
We introduce HARP, Human Annotated Reasoning Problems (for Math), consisting of 5,409 problems from the US national math competitions (A(J)HSME, AMC, AIME, USA(J)MO).<n>Of these, 4,780 have answers that are automatically check-able (with libraries such as SymPy).<n>These problems range six difficulty levels, with frontier models performing relatively poorly on the hardest bracket of 197 problems (average accuracy 41.1% for o1-mini, and 9.6% for Gemini 1.5 Pro).<n>Our dataset also features multiple choices (for 4,110 problems) and an average of two human-written
arXiv Detail & Related papers (2024-12-11T23:31:06Z)
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions [72.56339136017759]
We introduce BigCodeBench, a benchmark that challenges Large Language Models (LLMs) to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks. Our evaluation shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. We propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information.
arXiv Detail & Related papers (2024-06-22T15:52:04Z)
Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems [59.72548591120689]
We introduce a new benchmark, SearchBench, containing 11 unique search problem types. We show that even the most advanced LLMs fail to solve these problems end-to-end in text. Instructing LLMs to generate code that solves the problem helps, but only slightly, e.g., GPT4's performance rises to 11.7%.
arXiv Detail & Related papers (2024-06-18T00:44:58Z)
Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems [50.76385564061713]
Chain-of-Thought (CoT) prompting has enhanced the performance of Large Language Models (LLMs) across various reasoning tasks. CoT usually suffers from three pitfalls: semantic misunderstanding errors, calculation errors, and step-missing errors. We propose Deeply Understanding the Problems (DUP) to improve the LLMs' math problem-solving ability by addressing semantic misunderstanding errors.
arXiv Detail & Related papers (2024-04-23T12:16:05Z)
Large Language Models Struggle with Unreasonability in Math Problems [41.970853209666224]
Large Language Models (LLMs) have shown remarkable success on a wide range of math and reasoning benchmarks.<n>We observe that they often struggle when faced with unreasonable math problems.<n>We propose the textbfUnreasonable Math Problems (UMP) benchmark, designed to evaluate LLMs' ability to detect and respond to unreasonable math problem statements.
arXiv Detail & Related papers (2024-03-28T12:04:28Z)
PAL: Program-aided Language Models [112.94785609781503]
We present Program-Aided Language models (PaL) to understand natural language problems. PaL offloads the solution step to a programmatic runtime such as a Python interpreter. We set new state-of-the-art results in all 12 benchmarks.
arXiv Detail & Related papers (2022-11-18T18:56:13Z)
Large Language Models are Zero-Shot Reasoners [28.6899375595088]
Chain of thought (CoT) prompting is a technique for eliciting complex multi-step reasoning through step-by-step answer examples. We show that LLMs are decent zero-shot reasoners by simply adding Let's think step by step'' before each answer. Experimental results demonstrate that our Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances.
arXiv Detail & Related papers (2022-05-24T09:22:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.