Related papers: LLaMa-SciQ: An Educational Chatbot for Answering Science MCQ

LLaMa-SciQ: An Educational Chatbot for Answering Science MCQ

URL: http://arxiv.org/abs/2409.16779v1
Date: Wed, 25 Sep 2024 09:41:46 GMT
Title: LLaMa-SciQ: An Educational Chatbot for Answering Science MCQ
Authors: Marc-Antoine Allard, Matin Ansaripour, Maria Yuffa, Paul Teiletche,
Abstract summary: Large Language Models (LLMs) often struggle with tasks requiring mathematical reasoning, particularly multiple-choice questions (MCQs) We developed LLaMa-SciQ to assist college students in solving and understanding MCQs in STEM fields. For mathematical reasoning, LLaMa-SciQ achieved 74.5% accuracy on the GSM8k dataset and 30% on the MATH dataset.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) often struggle with tasks requiring mathematical reasoning, particularly multiple-choice questions (MCQs). To address this issue, we developed LLaMa-SciQ, an educational chatbot designed to assist college students in solving and understanding MCQs in STEM fields. We begin by fine-tuning and aligning the models to human preferences. After comparing the performance of Mistral-7B and LLaMa-8B, we selected the latter as the base model due to its higher evaluation accuracy. To further enhance accuracy, we implement Retrieval-Augmented Generation (RAG) and apply quantization to compress the model, reducing inference time and increasing accessibility for students. For mathematical reasoning, LLaMa-SciQ achieved 74.5% accuracy on the GSM8k dataset and 30% on the MATH dataset. However, RAG does not improve performance and even reduces it, likely due to retriever issues or the model's unfamiliarity with context. Despite this, the quantized model shows only a 5% loss in performance, demonstrating significant efficiency improvements.

Related papers

From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics [79.81905350372067]
We study gap through contextual mathematical reasoning.<n>We introduce ContextMATH, a benchmark that repurposes AIME and MATH-500 problems into two contextual settings.<n>Open-source models decline by 13 and 34 points on SG and CS, while proprietary models drop by 13 and 20.
arXiv Detail & Related papers (2026-01-30T14:56:04Z)
†DAGGER: Distractor-Aware Graph Generation for Executable Reasoning in Math Problems [1.2310602580215997]
Chain-of-Thought (CoT) prompting is widely adopted for mathematical problem solving, but its behavior under irrelevant context remains underexplored.<n>DisTRACTMATH-BN is a benchmark that augments MGSM and MSVAMP with semantically coherent but computationally irrelevant information.<n>DAGGER reformulates mathematical problem solving as executable computational graph generation with explicit modeling of distractor nodes.
arXiv Detail & Related papers (2026-01-11T10:51:03Z)
Think Right: Learning to Mitigate Under-Over Thinking via Adaptive, Attentive Compression [68.69801176669843]
We propose an online post-training RL method that prunes redundant steps and estimates difficulty.<n> TRAAC (Think Right with Adaptive, Attentive Compression) achieves an average absolute accuracy gain of 8.4%.<n>Although our models are trained on math datasets, they show accuracy and efficiency gains on out-of-distribution non-math datasets.
arXiv Detail & Related papers (2025-10-02T02:00:20Z)
Discriminative Policy Optimization for Token-Level Reward Models [55.98642069903191]
Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs)<n>Q-RM explicitly learns token-level Q-functions from preference data without relying on fine-grained annotations.<n>Reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12 times faster than ORM on GSM8K and 11 times faster than step-level PRM on MATH.
arXiv Detail & Related papers (2025-05-29T11:40:34Z)
InfiJanice: Joint Analysis and In-situ Correction Engine for Quantization-Induced Math Degradation in Large Language Models [39.257022875813284]
Large Language Models (LLMs) have demonstrated impressive performance on complex reasoning benchmarks such as GSM8K, MATH, and AIME.<n>Model quantization has emerged as a promising approach to reduce memory footprint and inference latency.<n>We show that quantization can degrade mathematical reasoning accuracy by up to 69.81%.
arXiv Detail & Related papers (2025-05-16T12:11:40Z)
Sample, Don't Search: Rethinking Test-Time Alignment for Language Models [55.2480439325792]
We introduce QAlign, a new test-time alignment approach. As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt. By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access.
arXiv Detail & Related papers (2025-04-04T00:41:40Z)
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations [90.07275414500154]
We observe significant performance drops on MATH-P-Hard across various models. We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills.
arXiv Detail & Related papers (2025-02-10T13:31:46Z)
SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights [89.56181323849512]
We propose SuperCorrect, a framework that uses a large teacher model to supervise and correct both the reasoning and reflection processes of a smaller student model. In the first stage, we extract hierarchical high-level and detailed thought templates from the teacher model to guide the student model in eliciting more fine-grained reasoning thoughts. In the second stage, we introduce cross-model collaborative direct preference optimization (DPO) to enhance the self-correction abilities of the student model.
arXiv Detail & Related papers (2024-10-11T17:25:52Z)
Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On [55.449818944278526]
We introduce the Skywork-Math model series, supervised fine-tuned (SFT) on common 7B language models. Skywork-Math 7B has achieved impressive accuracies of 51.2% on the competition-level MATH benchmark. We provide several practical takeaways to enhance math reasoning abilities in LLMs for both research and industry applications.
arXiv Detail & Related papers (2024-07-11T09:56:51Z)
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models. It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z)
Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps. We show that ReasonEval achieves state-of-the-art performance on human-labeled datasets. We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z)
Common 7B Language Models Already Possess Strong Math Capabilities [61.61442513067561]
This paper shows that the LLaMA-2 7B model with common pre-training already exhibits strong mathematical abilities. The potential for extensive scaling is constrained by the scarcity of publicly available math questions.
arXiv Detail & Related papers (2024-03-07T18:00:40Z)
MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs [38.127313175508746]
MathGenie is a novel method for generating diverse and reliable math problems from a small-scale problem-solution dataset. Various pretrained models, ranging from 7B to 70B, are trained on the newly curated data to test the effectiveness of the proposed augmentation technique. MathGenieLM-InternLM2 achieves an accuracy of 87.7% on GSM8K and 55.7% on MATH, securing the best overall score among open-source language models.
arXiv Detail & Related papers (2024-02-26T07:17:25Z)
Augmenting Math Word Problems via Iterative Question Composing [8.186291374940595]
We introduce the MMIQC dataset, comprising a mixture of processed web data and synthetic question-response pairs. Qwen-72B-MMIQC achieves a 45.0% accuracy, exceeding the previous open-source state-of-the-art by 8.2%.
arXiv Detail & Related papers (2024-01-17T06:48:16Z)
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models [75.29595679428105]
We investigate how the pre-training loss, supervised data amount, and augmented data amount influence the reasoning performances of a supervised LLM. We find that rejection samples from multiple models push LLaMA-7B to an accuracy of 49.3% on GSM8K which outperforms the supervised fine-tuning (SFT) accuracy of 35.9% significantly.
arXiv Detail & Related papers (2023-08-03T15:34:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.