Exploring the MIT Mathematics and EECS Curriculum Using Large Language
Models
- URL: http://arxiv.org/abs/2306.08997v2
- Date: Sat, 24 Jun 2023 12:39:06 GMT
- Title: Exploring the MIT Mathematics and EECS Curriculum Using Large Language
Models
- Authors: Sarah J. Zhang, Samuel Florin, Ariel N. Lee, Eamon Niknafs, Andrei
Marginean, Annie Wang, Keith Tyser, Zad Chin, Yann Hicke, Nikhil Singh,
Madeleine Udell, Yoon Kim, Tonio Buonassisi, Armando Solar-Lezama, Iddo Drori
- Abstract summary: We evaluate the ability of large language models to fulfill the graduation requirements for any MIT major in Mathematics and EECS.
Our results demonstrate that GPT-3.5 successfully solves a third of the entire MIT curriculum, while GPT-4, with prompt engineering, achieves a perfect solve rate on a test set excluding questions based on images.
- Score: 21.86774454216937
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We curate a comprehensive dataset of 4,550 questions and solutions from
problem sets, midterm exams, and final exams across all MIT Mathematics and
Electrical Engineering and Computer Science (EECS) courses required for
obtaining a degree. We evaluate the ability of large language models to fulfill
the graduation requirements for any MIT major in Mathematics and EECS. Our
results demonstrate that GPT-3.5 successfully solves a third of the entire MIT
curriculum, while GPT-4, with prompt engineering, achieves a perfect solve rate
on a test set excluding questions based on images. We fine-tune an open-source
large language model on this dataset. We employ GPT-4 to automatically grade
model responses, providing a detailed performance breakdown by course,
question, and answer type. By embedding questions in a low-dimensional space,
we explore the relationships between questions, topics, and classes and
discover which questions and classes are required for solving other questions
and classes through few-shot learning. Our analysis offers valuable insights
into course prerequisites and curriculum design, highlighting language models'
potential for learning and improving Mathematics and EECS education.
Related papers
- Could ChatGPT get an Engineering Degree? Evaluating Higher Education Vulnerability to AI Assistants [175.9723801486487]
We evaluate whether two AI assistants, GPT-3.5 and GPT-4, can adequately answer assessment questions.
GPT-4 answers an average of 65.8% of questions correctly, and can even produce the correct answer across at least one prompting strategy for 85.1% of questions.
Our results call for revising program-level assessment design in higher education in light of advances in generative AI.
arXiv Detail & Related papers (2024-08-07T12:11:49Z) - SceMQA: A Scientific College Entrance Level Multimodal Question
Answering Benchmark [42.91902601376494]
The paper introduces SceMQA, a novel benchmark for scientific multimodal question answering at the college entrance level.
SceMQA focuses on core science subjects including Mathematics, Physics, Chemistry, and Biology.
It features a blend of multiple-choice and free-response formats, ensuring a comprehensive evaluation of AI models' abilities.
arXiv Detail & Related papers (2024-02-06T19:16:55Z) - MathVista: Evaluating Mathematical Reasoning of Foundation Models in
Visual Contexts [170.01089233942594]
MathVista is a benchmark designed to combine challenges from diverse mathematical and visual tasks.
The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%.
GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning.
arXiv Detail & Related papers (2023-10-03T17:57:24Z) - CMATH: Can Your Language Model Pass Chinese Elementary School Math Test? [15.53530547827583]
We present the Chinese Elementary School Math Word Problems dataset, comprising 1.7k elementary school-level math word problems with detailed annotations.
This dataset aims to provide a benchmark tool for assessing the abilities of popular large language models (LLMs)
We evaluate a variety of popular LLMs, including both commercial and open-source options, and discover that only GPT-4 achieves success (accuracy $geq$ 60%) across all six elementary school grades.
arXiv Detail & Related papers (2023-06-29T02:19:50Z) - Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission
Exams [4.2706617195518195]
This study aims to explore the capabilities of Language Models (LMs) in tackling high-stakes multiple-choice tests.
This exam poses challenging tasks for LMs, since its questions may span into multiple fields of knowledge.
The best-performing model, GPT-4 with Chain-of-Thought prompts, achieved an accuracy of 87%, largely surpassing GPT-3.5 by 11 points.
arXiv Detail & Related papers (2023-03-29T20:10:13Z) - Lila: A Unified Benchmark for Mathematical Reasoning [59.97570380432861]
LILA is a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions.
We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs.
We introduce BHASKARA, a general-purpose mathematical reasoning model trained on LILA.
arXiv Detail & Related papers (2022-10-31T17:41:26Z) - Knowledge-in-Context: Towards Knowledgeable Semi-Parametric Language
Models [58.42146641102329]
We develop a novel semi-parametric language model architecture, Knowledge-in-Context (KiC)
KiC empowers a parametric text-to-text language model with a knowledge-rich external memory.
As a knowledge-rich semi-parametric language model, KiC only needs a much smaller part to achieve superior zero-shot performance on unseen tasks.
arXiv Detail & Related papers (2022-10-28T23:18:43Z) - Learn to Explain: Multimodal Reasoning via Thought Chains for Science
Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations.
We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA.
Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z) - Solving Quantitative Reasoning Problems with Language Models [53.53969870599973]
We introduce Minerva, a large language model pretrained on general natural language data and further trained on technical content.
The model achieves state-of-the-art performance on technical benchmarks without the use of external tools.
We also evaluate our model on over two hundred undergraduate-level problems in physics, biology, chemistry, economics, and other sciences.
arXiv Detail & Related papers (2022-06-29T18:54:49Z) - From Human Days to Machine Seconds: Automatically Answering and
Generating Machine Learning Final Exams [10.25071232250652]
A final exam in machine learning at a top institution such as MIT, Harvard, or Cornell typically takes faculty days to write, and students hours to solve.
We demonstrate that large language models pass machine learning finals at a human level, on finals available online after the models were trained, and automatically generate new human-quality final exam questions in seconds.
arXiv Detail & Related papers (2022-06-11T06:38:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.