Related papers: Who's the Best Detective? LLMs vs. MLs in Detecting Incoherent Fourth Grade Math Answers

Who's the Best Detective? LLMs vs. MLs in Detecting Incoherent Fourth Grade Math Answers

URL: http://arxiv.org/abs/2304.11257v1
Date: Fri, 21 Apr 2023 21:25:30 GMT
Title: Who's the Best Detective? LLMs vs. MLs in Detecting Incoherent Fourth Grade Math Answers
Authors: Felipe Urrutia and Roberto Araya
Abstract summary: We analyze the responses of fourth graders in mathematics using three Large Language Models (LLM) We found that LLMs perform worse than Machine Learning (ML) in detecting incoherent answers.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Written answers to open-ended questions can have a higher long-term effect on learning than multiple-choice questions. However, it is critical that teachers immediately review the answers, and ask to redo those that are incoherent. This can be a difficult task and can be time-consuming for teachers. A possible solution is to automate the detection of incoherent answers. One option is to automate the review with Large Language Models (LLM). In this paper, we analyze the responses of fourth graders in mathematics using three LLMs: GPT-3, BLOOM, and YOU. We used them with zero, one, two, three and four shots. We compared their performance with the results of various classifiers trained with Machine Learning (ML). We found that LLMs perform worse than MLs in detecting incoherent answers. The difficulty seems to reside in recursive questions that contain both questions and answers, and in responses from students with typical fourth-grader misspellings. Upon closer examination, we have found that the ChatGPT model faces the same challenges.

Related papers

Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above [14.5781090243416]
Multiple choice question answering (MCQA) is popular for LLM evaluation due to its simplicity and human-like testing. We first reveal flaws in MCQA's format, as it struggles to: 1) test generation/subjectivity; 2) match LLM use cases; and 3) fully test knowledge.
arXiv Detail & Related papers (2025-02-19T22:11:52Z)
Comparison of Large Language Models for Generating Contextually Relevant Questions [6.080820450677854]
GPT-3.5, Llama 2-Chat 13B, and T5 XXL are compared in their ability to create questions from university slide text without fine-tuning. Results indicate that GPT-3.5 and Llama 2-Chat 13B outperform T5 XXL by a small margin, particularly in terms of clarity and question-answer alignment.
arXiv Detail & Related papers (2024-07-30T06:23:59Z)
Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange [25.419977967846144]
Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks. This paper explores the current limitations of LLMs in navigating complex mathematical problem-solving.
arXiv Detail & Related papers (2024-03-30T12:48:31Z)
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? [99.0305256706604]
We introduce MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning.
arXiv Detail & Related papers (2024-03-21T17:59:50Z)
Benchmarking Hallucination in Large Language Models based on Unanswerable Math Word Problem [58.3723958800254]
Large language models (LLMs) are highly effective in various natural language processing (NLP) tasks. They are susceptible to producing unreliable conjectures in ambiguous contexts called hallucination. This paper presents a new method for evaluating LLM hallucination in Question Answering (QA) based on the unanswerable math word problem (MWP)
arXiv Detail & Related papers (2024-03-06T09:06:34Z)
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks. One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly. This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z)
The Earth is Flat? Unveiling Factual Errors in Large Language Models [89.94270049334479]
Large Language Models (LLMs) like ChatGPT are in various applications due to their extensive knowledge from pre-training and fine-tuning. Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education. We introduce a novel, automatic testing framework, FactChecker, aimed at uncovering factual inaccuracies in LLMs.
arXiv Detail & Related papers (2024-01-01T14:02:27Z)
Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts [22.669502403623166]
We present Reasoning Question Prompts for VQA tasks, which can further activate the potential of Large Language Models. We generate self-contained questions as reasoning question prompts via an unsupervised question edition module. Each reasoning question prompt clearly indicates the intent of the original question. Then, the candidate answers associated with their confidence scores acting as answer integritys are fed into LLMs.
arXiv Detail & Related papers (2023-11-15T15:40:46Z)
SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning [55.76083560152823]
SelfCheck is a general-purpose zero-shot verification schema for recognizing errors in step-by-step reasoning. We test SelfCheck on three datasets (GSM8K, MathQA, and MATH) and find that it successfully recognizes errors and, in turn, increases final answer accuracies.
arXiv Detail & Related papers (2023-08-01T10:31:36Z)
Exploring the Responses of Large Language Models to Beginner Programmers' Help Requests [1.8260333137469122]
We assess how good large language models (LLMs) are at identifying issues in problematic code that students request help on. We collected a sample of help requests and code from an online programming course.
arXiv Detail & Related papers (2023-06-09T07:19:43Z)
Large Language Models are Better Reasoners with Self-Verification [48.534270563880845]
Large language models (LLMs) have shown strong reasoning ability in several natural language processing tasks. LLMs with chain of thought (CoT) prompting require multi-step prompting and multi-token prediction, which is highly sensitive to individual mistakes. We propose and prove that LLMs also have similar self-verification abilities.
arXiv Detail & Related papers (2022-12-19T15:51:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.