OpenAI-o1 AB Testing: Does the o1 model really do good reasoning in math problem solving?
- URL: http://arxiv.org/abs/2411.06198v1
- Date: Sat, 09 Nov 2024 14:47:52 GMT
- Title: OpenAI-o1 AB Testing: Does the o1 model really do good reasoning in math problem solving?
- Authors: Leo Li, Ye Luo, Tingyou Pan,
- Abstract summary: The Orion-1 model by OpenAI is claimed to have more robust logical reasoning capabilities than previous large language models.
We conduct a comparison experiment using two datasets: one consisting of International Mathematics Olympiad (IMO) problems, which is easily accessible.
We conclude that there is no significant evidence to show that the model relies on memorizing problems and solutions.
- Score: 2.851415653352522
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Orion-1 model by OpenAI is claimed to have more robust logical reasoning capabilities than previous large language models. However, some suggest the excellence might be partially due to the model "memorizing" solutions, resulting in less satisfactory performance when prompted with problems not in the training data. We conduct a comparison experiment using two datasets: one consisting of International Mathematics Olympiad (IMO) problems, which is easily accessible; the other one consisting of Chinese National Team Training camp (CNT) problems, which have similar difficulty but not as publically accessible. We label the response for each problem and compare the performance between the two datasets. We conclude that there is no significant evidence to show that the model relies on memorizing problems and solutions. Also, we perform case studies to analyze some features of the model's response.
Related papers
- Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT? [59.418994222096885]
We conduct a detailed analysis of model performance on the AIME24 dataset.
We categorize questions into four tiers (Easy, Medium, Hard, and Extremely Hard)
We find that progression from Easy to Medium tier requires adopting an R1 reasoning style with minimal SFT-1K instances.
Exh-level questions present a fundamentally different challenge; they require unconventional problem-solving skills.
arXiv Detail & Related papers (2025-04-16T03:39:38Z) - Some things to know about achieving artificial general intelligence [0.0]
Current and foreseeable GenAI models are not capable of achieving artificial general intelligence because they are burdened with anthropogenic debt.
They depend heavily on human input to provide well-structured problems, architecture, and training data.
They cast every problem as a language pattern learning problem and are thus not capable of the kind of autonomy needed to achieve artificial general intelligence.
arXiv Detail & Related papers (2025-02-10T20:10:26Z) - MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations [90.07275414500154]
We observe significant performance drops on MATH-P-Hard across various models.
We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills.
arXiv Detail & Related papers (2025-02-10T13:31:46Z) - Self-supervised Analogical Learning using Language Models [59.64260218737556]
We propose SAL, a self-supervised analogical learning framework.
SAL mimics the human analogy process and trains models to explicitly transfer high-quality symbolic solutions.
We show that the resulting models outperform base language models on a wide range of reasoning benchmarks.
arXiv Detail & Related papers (2025-02-03T02:31:26Z) - Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [76.43407125275202]
o1-like models can emulate human-like long-time thinking during inference.
This paper presents the first comprehensive study on the prevalent issue of overthinking in these models.
We propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy.
arXiv Detail & Related papers (2024-12-30T18:55:12Z) - ProcessBench: Identifying Process Errors in Mathematical Reasoning [62.80402845414901]
We introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning.
ProcessBench consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems.
We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models.
arXiv Detail & Related papers (2024-12-09T15:11:40Z) - Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.40788744292739]
We propose a two-player paradigm that separates the roles of reasoning and critique models.
We first propose AutoMathCritique, an automated and scalable framework for collecting critique data.
We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time.
arXiv Detail & Related papers (2024-11-25T17:11:54Z) - A Comparative Study on Reasoning Patterns of OpenAI's o1 Model [69.08287909042421]
We show that OpenAI's o1 model has achieved the best performance on most datasets.
We also provide a detailed analysis on several reasoning benchmarks.
arXiv Detail & Related papers (2024-10-17T15:09:03Z) - Reasoning Paths Optimization: Learning to Reason and Explore From Diverse Paths [69.39559168050923]
We introduce Reasoning Paths Optimization (RPO), which enables learning to reason and explore from diverse paths.
Our approach encourages favorable branches at each reasoning step while penalizing unfavorable ones, enhancing the model's overall problem-solving performance.
We focus on multi-step reasoning tasks, such as math word problems and science-based exam questions.
arXiv Detail & Related papers (2024-10-07T06:37:25Z) - DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving [15.815363023014248]
We propose Difficulty-Aware Rejection Tuning (DART), a method that allocates difficult queries more trials during the synthesis phase.
DART allocates difficult queries more trials during the synthesis phase, enabling more extensive training on difficult samples.
We fine-tune various base models on our datasets ranging from 7B to 70B in size, resulting in a series of strong models called DART-MATH.
arXiv Detail & Related papers (2024-06-18T07:14:02Z) - CHAMP: A Competition-level Dataset for Fine-Grained Analyses of LLMs' Mathematical Reasoning Capabilities [25.857946070979576]
Concept and Hint-Annotated Math Problems (CHAMP) consists of high school math competition problems annotated with concepts.
This benchmark is difficult, with the best model only scoring 58.1% in standard settings.
We find that models often arrive at the correct final answer through wrong reasoning steps.
arXiv Detail & Related papers (2024-01-13T03:18:16Z) - Conic10K: A Challenging Math Problem Understanding and Reasoning Dataset [38.99073257782012]
We propose Conic10K, a challenging math problem dataset on conic sections in Chinese senior high school education.
Our dataset contains various problems with different reasoning depths, while only the knowledge from conic sections is required.
For each problem, we provide a high-quality formal representation, the reasoning steps, and the final solution.
arXiv Detail & Related papers (2023-11-09T02:58:17Z) - Information-Theoretic Measures of Dataset Difficulty [54.538766940287864]
Estimating difficulty of a dataset typically involves comparing state-of-the-art models to humans.
We propose an information-theoretic perspective, framing dataset difficulty as the absence of usable information.
arXiv Detail & Related papers (2021-10-16T00:21:42Z) - What's the best place for an AI conference, Vancouver or ______: Why
completing comparative questions is difficult [22.04829832439774]
We study the ability of neural LMs to ask (not answer) reasonable questions.
We show that accuracy in this fill-in-the-blank task is well-correlated with human judgements of whether a question is reasonable.
arXiv Detail & Related papers (2021-04-05T14:56:09Z) - Decentralized Federated Learning Preserves Model and Data Privacy [77.454688257702]
We propose a fully decentralized approach, which allows to share knowledge between trained models.
Students are trained on the output of their teachers via synthetically generated input data.
The results show that an untrained student model, trained on the teachers output reaches comparable F1-scores as the teacher.
arXiv Detail & Related papers (2021-02-01T14:38:54Z) - SMART: A Situation Model for Algebra Story Problems via Attributed
Grammar [74.1315776256292]
We introduce the concept of a emphsituation model, which originates from psychology studies to represent the mental states of humans in problem-solving.
We show that the proposed model outperforms all previous neural solvers by a large margin while preserving much better interpretability.
arXiv Detail & Related papers (2020-12-27T21:03:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.