Evaluating Language Models for Mathematics through Interactions
- URL: http://arxiv.org/abs/2306.01694v2
- Date: Sun, 5 Nov 2023 19:12:20 GMT
- Title: Evaluating Language Models for Mathematics through Interactions
- Authors: Katherine M. Collins and Albert Q. Jiang and Simon Frieder and Lionel
Wong and Miri Zilka and Umang Bhatt and Thomas Lukasiewicz and Yuhuai Wu and
Joshua B. Tenenbaum and William Hart and Timothy Gowers and Wenda Li and
Adrian Weller and Mateja Jamnik
- Abstract summary: We introduce CheckMate, a prototype platform for humans to interact with and evaluate large language models (LLMs)
We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics.
We derive a taxonomy of human behaviours and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness.
- Score: 116.67206980096513
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There is much excitement about the opportunity to harness the power of large
language models (LLMs) when building problem-solving assistants. However, the
standard methodology of evaluating LLMs relies on static pairs of inputs and
outputs, and is insufficient for making an informed decision about which LLMs
and under which assistive settings can they be sensibly used. Static assessment
fails to account for the essential interactive element in LLM deployment, and
therefore limits how we understand language model capabilities. We introduce
CheckMate, an adaptable prototype platform for humans to interact with and
evaluate LLMs. We conduct a study with CheckMate to evaluate three language
models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving
undergraduate-level mathematics, with a mixed cohort of participants from
undergraduate students to professors of mathematics. We release the resulting
interaction and rating dataset, MathConverse. By analysing MathConverse, we
derive a taxonomy of human behaviours and uncover that despite a generally
positive correlation, there are notable instances of divergence between
correctness and perceived helpfulness in LLM generations, amongst other
findings. Further, we garner a more granular understanding of GPT-4
mathematical problem-solving through a series of case studies, contributed by
expert mathematicians. We conclude with actionable takeaways for ML
practitioners and mathematicians: models that communicate uncertainty respond
well to user corrections, and are more interpretable and concise may constitute
better assistants. Interactive evaluation is a promising way to navigate the
capability of these models; humans should be aware of language models'
algebraic fallibility and discern where they are appropriate to use.
Related papers
- Assessing the Emergent Symbolic Reasoning Abilities of Llama Large Language Models [47.129504708849446]
Large Language Models (LLMs) achieve impressive performance in a wide range of tasks.
LLMs show emergent abilities in mathematical reasoning benchmarks.
We evaluate three models of the Llama 2 family on different symbolic reasoning tasks.
arXiv Detail & Related papers (2024-06-05T12:22:43Z) - MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions [58.57255822646756]
This paper introduces MathChat, a benchmark designed to evaluate large language models (LLMs) across a broader spectrum of mathematical tasks.
We evaluate the performance of various SOTA LLMs on the MathChat benchmark, and we observe that while these models excel in single turn question answering, they significantly underperform in more complex scenarios.
We develop MathChat sync, a synthetic dialogue based math dataset for LLM finetuning, focusing on improving models' interaction and instruction following capabilities in conversations.
arXiv Detail & Related papers (2024-05-29T18:45:55Z) - ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline [42.61538071832468]
Large language models (LLMs) have shown excellent mastering of human language, but still struggle in real-world applications that require mathematical problem-solving.
We tailor the Self-Critique pipeline, which addresses the challenge in the feedback learning stage of LLM alignment.
arXiv Detail & Related papers (2024-04-03T17:51:18Z) - GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks.
One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly.
This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z) - Towards Understanding Counseling Conversations: Domain Knowledge and
Large Language Models [22.588557390720236]
This paper proposes a systematic approach to examine the efficacy of domain knowledge and large language models (LLMs) in better representing counseling conversations.
We empirically show that state-of-the-art language models such as Transformer-based models and GPT models fail to predict the conversation outcome.
arXiv Detail & Related papers (2024-02-22T01:02:37Z) - Large Language Models for Mathematicians [53.27302720305432]
Large language models (LLMs) have received immense interest for their general-purpose language understanding and, in particular, their ability to generate high-quality text or computer code.
In this note, we discuss to what extent they can aid professional mathematicians.
arXiv Detail & Related papers (2023-12-07T18:59:29Z) - CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large
Language Models for Data Annotation [94.59630161324013]
We propose CoAnnotating, a novel paradigm for Human-LLM co-annotation of unstructured texts at scale.
Our empirical study shows CoAnnotating to be an effective means to allocate work from results on different datasets, with up to 21% performance improvement over random baseline.
arXiv Detail & Related papers (2023-10-24T08:56:49Z) - Democratizing Reasoning Ability: Tailored Learning from Large Language
Model [97.4921006089966]
We propose a tailored learning approach to distill such reasoning ability to smaller LMs.
We exploit the potential of LLM as a reasoning teacher by building an interactive multi-round learning paradigm.
To exploit the reasoning potential of the smaller LM, we propose self-reflection learning to motivate the student to learn from self-made mistakes.
arXiv Detail & Related papers (2023-10-20T07:50:10Z) - Do LLMs Understand Social Knowledge? Evaluating the Sociability of Large
Language Models with SocKET Benchmark [14.922083834969323]
Large language models (LLMs) have been shown to perform well at a variety of syntactic, discourse, and reasoning tasks.
We introduce a new theory-driven benchmark, SocKET, that contains 58 NLP tasks testing social knowledge.
arXiv Detail & Related papers (2023-05-24T09:21:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.