Related papers: Language Models as Science Tutors

Language Models as Science Tutors

URL: http://arxiv.org/abs/2402.11111v2
Date: Sun, 21 Jul 2024 11:11:49 GMT
Title: Language Models as Science Tutors
Authors: Alexis Chevalier, Jiayi Geng, Alexander Wettig, Howard Chen, Sebastian Mizera, Toni Annala, Max Jameson Aragon, Arturo Rodríguez Fanlo, Simon Frieder, Simon Machado, Akshara Prabhakar, Ellie Thieu, Jiachen T. Wang, Zirui Wang, Xindi Wu, Mengzhou Xia, Wenhan Xia, Jiatong Yu, Jun-Jie Zhu, Zhiyong Jason Ren, Sanjeev Arora, Danqi Chen,
Abstract summary: We introduce TutorEval and TutorChat to measure real-life usability of LMs as scientific assistants. We show that fine-tuning base models with existing dialogue datasets leads to poor performance on TutorEval. We use TutorChat to fine-tune Llemma models with 7B and 34B parameters. These LM tutors specialized in math have a 32K-token context window, and they excel at TutorEval while performing strongly on GSM8K and MATH.
Score: 79.73256703631492
License: http://creativecommons.org/licenses/by/4.0/
Abstract: NLP has recently made exciting progress toward training language models (LMs) with strong scientific problem-solving skills. However, model development has not focused on real-life use-cases of LMs for science, including applications in education that require processing long scientific documents. To address this, we introduce TutorEval and TutorChat. TutorEval is a diverse question-answering benchmark consisting of questions about long chapters from STEM textbooks, written by experts. TutorEval helps measure real-life usability of LMs as scientific assistants, and it is the first benchmark combining long contexts, free-form generation, and multi-disciplinary scientific knowledge. Moreover, we show that fine-tuning base models with existing dialogue datasets leads to poor performance on TutorEval. Therefore, we create TutorChat, a dataset of 80,000 long synthetic dialogues about textbooks. We use TutorChat to fine-tune Llemma models with 7B and 34B parameters. These LM tutors specialized in math have a 32K-token context window, and they excel at TutorEval while performing strongly on GSM8K and MATH. Our datasets build on open-source materials, and we release our models, data, and evaluations.

Related papers

Transforming Mentorship: An AI Powered Chatbot Approach to University Guidance [0.0]
This paper presents an AI-powered bot that will serve as a mentor for students of BRAC University.<n>The main component is a data ingestion pipeline that efficiently processes and updates information from diverse sources.<n>The generated text was found to be semantically highly relevant, with a BERTScore of 0.831 and a METEOR score of 0.809.
arXiv Detail & Related papers (2025-11-06T08:24:52Z)
Who Gets Cited Most? Benchmarking Long-Context Language Models on Scientific Articles [81.89404347890662]
SciTrek is a novel question-answering benchmark designed to evaluate the long-context reasoning capabilities of large language models (LLMs) using scientific articles.<n>Our analysis reveals systematic shortcomings in models' abilities to perform basic numerical operations and accurately locate specific information in long contexts.
arXiv Detail & Related papers (2025-09-25T11:36:09Z)
MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors [76.1634959528817]
We present MathTutorBench, an open-source benchmark for holistic tutoring model evaluation. MathTutorBench contains datasets and metrics that broadly cover tutor abilities as defined by learning sciences research in dialog-based teaching. We evaluate a wide set of closed- and open-weight models and find that subject expertise, indicated by solving ability, does not immediately translate to good teaching.
arXiv Detail & Related papers (2025-02-26T08:43:47Z)
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
This dataset includes figures such as schematic diagrams, simulated images, macroscopic/microscopic photos, and experimental visualizations. We developed benchmarks for scientific figure captioning and multiple-choice questions, evaluating six proprietary and over ten open-source models. The dataset and benchmarks will be released to support further research.
arXiv Detail & Related papers (2024-07-06T00:40:53Z)
Cross-Data Knowledge Graph Construction for LLM-enabled Educational Question-Answering System: A Case Study at HCMUT [2.8000537365271367]
Large language models (LLMs) have emerged as a vibrant research topic. LLMs face challenges in remembering events, incorporating new information, and addressing domain-specific issues or hallucinations. This article proposes a method for automatically constructing a Knowledge Graph from multiple data sources.
arXiv Detail & Related papers (2024-04-14T16:34:31Z)
AutoTutor meets Large Language Models: A Language Model Tutor with Rich Pedagogy and Guardrails [43.19453208130667]
Large Language Models (LLMs) have found several use cases in education, ranging from automatic question generation to essay evaluation. In this paper, we explore the potential of using Large Language Models (LLMs) to author Intelligent Tutoring Systems. We create a sample end-to-end tutoring system named MWPTutor, which uses LLMs to fill in the state space of a pre-defined finite state transducer.
arXiv Detail & Related papers (2024-02-14T14:53:56Z)
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [139.69207791947738]
Dolma is a three-trillion-token English corpus built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. We document Dolma, including its design principles, details about its construction, and a summary of its contents. We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices.
arXiv Detail & Related papers (2024-01-31T20:29:50Z)
Code Soliloquies for Accurate Calculations in Large Language Models [22.1024285108075]
High-quality conversational datasets are crucial for the successful development of Intelligent Tutoring Systems. These datasets are generated using advanced GPT-4 models. Our design orchestrates a mock conversation where both student and tutorbot roles are simulated by GPT-4. Our approach notably enhances the quality of synthetic conversation datasets, especially for subjects that are calculation-intensive.
arXiv Detail & Related papers (2023-09-21T15:16:58Z)
Simultaneous Machine Translation with Large Language Models [51.470478122113356]
We investigate the possibility of applying Large Language Models to SimulMT tasks. We conducted experiments using the textttLlama2-7b-chat model on nine different languages from the MUST-C dataset. The results show that LLM outperforms dedicated MT models in terms of BLEU and LAAL metrics.
arXiv Detail & Related papers (2023-09-13T04:06:47Z)
Evaluating Language Models for Mathematics through Interactions [116.67206980096513]
We introduce CheckMate, a prototype platform for humans to interact with and evaluate large language models (LLMs) We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics. We derive a taxonomy of human behaviours and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness.
arXiv Detail & Related papers (2023-06-02T17:12:25Z)
MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems [74.73881579517055]
We propose a framework to generate such dialogues by pairing human teachers with a Large Language Model prompted to represent common student errors. We describe how we use this framework to collect MathDial, a dataset of 3k one-to-one teacher-student tutoring dialogues.
arXiv Detail & Related papers (2023-05-23T21:44:56Z)
Context Matters: A Strategy to Pre-train Language Model for Science Education [4.053049694533914]
BERT-based language models have shown significant superiority over traditional NLP models in various language-related tasks. The language used by students is different from the language in journals and Wikipedia, which are training sources of BERT. Our study confirms the effectiveness of continual pre-training on domain-specific data in the education domain.
arXiv Detail & Related papers (2023-01-27T23:50:16Z)
Natural Language to Code Generation in Interactive Data Science Notebooks [35.621936471322385]
We build ARCADE, a benchmark of 1082 code generation problems using the pandas data analysis framework in data science notebooks. We develop PaChiNCo, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs.
arXiv Detail & Related papers (2022-12-19T05:06:00Z)
Few-Shot Bot: Prompt-Based Learning for Dialogue Systems [58.27337673451943]
Learning to converse using only a few examples is a great challenge in conversational AI. The current best conversational models are either good chit-chatters (e.g., BlenderBot) or goal-oriented systems (e.g., MinTL) We propose prompt-based few-shot learning which does not require gradient-based fine-tuning but instead uses a few examples as the only source of learning.
arXiv Detail & Related papers (2021-10-15T14:36:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.