Related papers: Examining the Potential and Pitfalls of ChatGPT in Science and Engineering Problem-Solving

Examining the Potential and Pitfalls of ChatGPT in Science and Engineering Problem-Solving

URL: http://arxiv.org/abs/2310.08773v2
Date: Sat, 28 Oct 2023 00:24:57 GMT
Title: Examining the Potential and Pitfalls of ChatGPT in Science and Engineering Problem-Solving
Authors: Karen D. Wang, Eric Burkholder, Carl Wieman, Shima Salehi, Nick Haber
Abstract summary: The study explores the capabilities of OpenAI's ChatGPT in solving different types of physics problems. ChatGPT could successfully solve 62.5% of the well-specified problems, but its accuracy drops to 8.3% for under-specified problems.
Score: 1.3628066756509705
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The study explores the capabilities of OpenAI's ChatGPT in solving different types of physics problems. ChatGPT (with GPT-4) was queried to solve a total of 40 problems from a college-level engineering physics course. These problems ranged from well-specified problems, where all data required for solving the problem was provided, to under-specified, real-world problems where not all necessary data were given. Our findings show that ChatGPT could successfully solve 62.5% of the well-specified problems, but its accuracy drops to 8.3% for under-specified problems. Analysis of the model's incorrect solutions revealed three distinct failure modes: 1) failure to construct accurate models of the physical world, 2) failure to make reasonable assumptions about missing data, and 3) calculation errors. The study offers implications for how to leverage LLM-augmented instructional materials to enhance STEM education. The insights also contribute to the broader discourse on AI's strengths and limitations, serving both educators aiming to leverage the technology and researchers investigating human-AI collaboration frameworks for problem-solving and decision-making.

Related papers

Benchmarking Large Language Models for Calculus Problem-Solving: A Comparative Analysis [0.0]
Five leading large language models (LLMs) were evaluated on their performance in solving calculus differentiation problems. Chat GPT 4o achieved the highest success rate (94.71%), followed by Claude Pro (85.74%), Gemini Advanced (84.42%), Copilot Pro (76.30%), and Meta AI (56.75%)
arXiv Detail & Related papers (2025-03-31T00:39:40Z)
Performance Comparison of Large Language Models on Advanced Calculus Problems [0.0]
The study aims to evaluate models' accuracy, reliability, and problem-solving capabilities, including ChatGPT 4o, Gemini Advanced with 1.5 Pro, Copilot Pro, Claude 3.5 Sonnet, Meta AI, Mistral AI, and Perplexity. The results highlight significant trends and patterns in the models' performance, revealing both their strengths and weaknesses.
arXiv Detail & Related papers (2025-03-05T23:26:12Z)
Large Language Models and Mathematical Reasoning Failures [1.6114012813668932]
This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems. We rigorously analyze both final answers and solution steps to identify reasoning failures. We find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic.
arXiv Detail & Related papers (2025-02-17T09:07:32Z)
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations [90.07275414500154]
We observe significant performance drops on MATH-P-Hard across various models. We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills.
arXiv Detail & Related papers (2025-02-10T13:31:46Z)
The Impossible Test: A 2024 Unsolvable Dataset and A Chance for an AGI Quiz [0.0]
We evaluate large language models' (LLMs) ability to acknowledge uncertainty on 675 fundamentally unsolvable problems. The best models scored in 62-68% accuracy ranges for admitting the problem solution was unknown in fields ranging from biology to philosophy and mathematics.
arXiv Detail & Related papers (2024-11-20T04:12:29Z)
Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation [39.805610561281455]
Large Language Models (LLMs) demonstrate promising capabilities in solving simple scientific problems. Human experts first assess problem complexity using domain knowledge before choosing an appropriate solution approach. We propose a novel two-component fine-tuning method. Our models demonstrate a 28.18% improvement in answer accuracy and a 13.89% increase in tool usage precision across all datasets.
arXiv Detail & Related papers (2024-11-01T07:18:31Z)
Reasoning Paths Optimization: Learning to Reason and Explore From Diverse Paths [69.39559168050923]
We introduce Reasoning Paths Optimization (RPO), which enables learning to reason and explore from diverse paths. Our approach encourages favorable branches at each reasoning step while penalizing unfavorable ones, enhancing the model's overall problem-solving performance. We focus on multi-step reasoning tasks, such as math word problems and science-based exam questions.
arXiv Detail & Related papers (2024-10-07T06:37:25Z)
Evaluation of OpenAI o1: Opportunities and Challenges of AGI [112.0812059747033]
o1-preview demonstrated remarkable capabilities, often achieving human-level or superior performance. The model excelled in tasks requiring intricate reasoning and knowledge integration across various fields. Overall results indicate significant progress towards artificial general intelligence.
arXiv Detail & Related papers (2024-09-27T06:57:00Z)
SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories [55.161075901665946]
Super aims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories. Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges, and 602 automatically generated problems for larger-scale development. We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios.
arXiv Detail & Related papers (2024-09-11T17:37:48Z)
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI [73.75520820608232]
We introduce OlympicArena, which includes 11,163 bilingual problems across both text-only and interleaved text-image modalities. These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage. Our evaluations reveal that even advanced models like GPT-4o only achieve a 39.97% overall accuracy, illustrating current AI limitations in complex reasoning and multimodal integration.
arXiv Detail & Related papers (2024-06-18T16:20:53Z)
Learning Task Decomposition to Assist Humans in Competitive Programming [90.4846613669734]
We introduce a novel objective for learning task decomposition, termed value (AssistV) We collect a dataset of human repair experiences on different decomposed solutions. Under 177 hours of human study, our method enables non-experts to solve 33.3% more problems, speeds them up by 3.3x, and empowers them to match unassisted experts.
arXiv Detail & Related papers (2024-06-07T03:27:51Z)
Competition-Level Problems are Effective LLM Evaluators [121.15880285283116]
This paper aims to evaluate the reasoning capacities of large language models (LLMs) in solving recent programming problems in Codeforces. We first provide a comprehensive evaluation of GPT-4's peiceived zero-shot performance on this task, considering various aspects such as problems' release time, difficulties, and types of errors encountered. Surprisingly, theThoughtived performance of GPT-4 has experienced a cliff like decline in problems after September 2021 consistently across all the difficulties and types of problems.
arXiv Detail & Related papers (2023-12-04T18:58:57Z)
Using Large Language Model to Solve and Explain Physics Word Problems Approaching Human Level [0.0]
Large language model (LLM) pre-trained on texts can not only solve pure math word problems, but also physics word problems. Our work is the first research to focus on the automatic solving, explanation, and generation of physics word problems.
arXiv Detail & Related papers (2023-09-15T06:13:06Z)
Extending the Frontier of ChatGPT: Code Generation and Debugging [0.0]
ChatGPT, developed by OpenAI, has ushered in a new era by utilizing artificial intelligence (AI) to tackle diverse problem domains. This research paper delves into the efficacy of ChatGPT in solving programming problems, examining both the correctness and the efficiency of its solution in terms of time and memory complexity. The research reveals a commendable overall success rate of 71.875%, denoting the proportion of problems for which ChatGPT was able to provide correct solutions.
arXiv Detail & Related papers (2023-07-17T06:06:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.