"Did my figure do justice to the answer?" : Towards Multimodal Short Answer Grading with Feedback (MMSAF)
- URL: http://arxiv.org/abs/2412.19755v2
- Date: Sat, 15 Feb 2025 21:52:23 GMT
- Title: "Did my figure do justice to the answer?" : Towards Multimodal Short Answer Grading with Feedback (MMSAF)
- Authors: Pritam Sil, Bhaskaran Raman, Pushpak Bhattacharyya,
- Abstract summary: We propose the Multimodal Short Answer Grading with Feedback problem along with a dataset of 2197 data points.
Our evaluations on existing Large Language Models (LLMs) over this dataset achieved an overall accuracy of 55% on the Level of Correctness labels.
As per human experts, Pixtral was more aligned towards human judgement and values for biology and ChatGPT for physics and chemistry.
- Score: 36.74896284581596
- License:
- Abstract: Assessments play a vital role in a student's learning process by providing feedback on a student's proficiency level in a subject. While assessments often make use of short answer questions, it is often difficult to grade such questions at a large scale. Moreover, such questions often involve students drawing supporting diagrams along with their textual explanations. Such questions often promote multimodal literacy and are aligned with competency-based questions, which demand a deeper cognitive processing ability from students. However, existing literature does not deal with the automatic grading of such answers. Thus, to bridge this gap, we propose the Multimodal Short Answer Grading with Feedback (MMSAF) problem along with a dataset of 2197 data points. Additionally, we provide an automated framework for generating such datasets. Our evaluations on existing Large Language Models (LLMs) over this dataset achieved an overall accuracy of 55% on the Level of Correctness labels and 75% on Image Relevance labels. As per human experts, Pixtral was more aligned towards human judgement and values for biology and ChatGPT for physics and chemistry and achieved a score of 4 or more out of 5 in most parameters.
Related papers
- Benchmarking large language models for materials synthesis: the case of atomic layer deposition [0.07528462379265576]
We introduce an open-ended question benchmark, ALDbench, to evaluate the performance of large language models (LLMs) in materials synthesis.
Our benchmark comprises questions with a level of difficulty ranging from graduate level to domain expert current with the state of the art in the field.
arXiv Detail & Related papers (2024-12-13T05:10:29Z) - How to Engage Your Readers? Generating Guiding Questions to Promote Active Reading [60.19226384241482]
We introduce GuidingQ, a dataset of 10K in-text questions from textbooks and scientific articles.
We explore various approaches to generate such questions using language models.
We conduct a human study to understand the implication of such questions on reading comprehension.
arXiv Detail & Related papers (2024-07-19T13:42:56Z) - SyllabusQA: A Course Logistics Question Answering Dataset [45.90423821963144]
We introduce SyllabusQA, an open-source dataset with 63 real course syllabi covering 36 majors, containing 5,078 open-ended course logistics-related question-answer pairs.
We benchmark several strong baselines on this task, from large language model prompting to retrieval-augmented generation.
We find that despite performing close to humans on traditional metrics of textual similarity, there remains a significant gap between automated approaches and humans in terms of fact precision.
arXiv Detail & Related papers (2024-03-03T03:01:14Z) - ExpertQA: Expert-Curated Questions and Attributed Answers [51.68314045809179]
We conduct human evaluation of responses from a few representative systems along various axes of attribution and factuality.
We collect expert-curated questions from 484 participants across 32 fields of study, and then ask the same experts to evaluate generated responses to their own questions.
The output of our analysis is ExpertQA, a high-quality long-form QA dataset with 2177 questions spanning 32 fields, along with verified answers and attributions for claims in the answers.
arXiv Detail & Related papers (2023-09-14T16:54:34Z) - Covering Uncommon Ground: Gap-Focused Question Generation for Answer
Assessment [75.59538732476346]
We focus on the problem of generating such gap-focused questions (GFQs) automatically.
We define the task, highlight key desired aspects of a good GFQ, and propose a model that satisfies these.
arXiv Detail & Related papers (2023-07-06T22:21:42Z) - Discourse Comprehension: A Question Answering Framework to Represent
Sentence Connections [35.005593397252746]
A key challenge in building and evaluating models for discourse comprehension is the lack of annotated data.
This paper presents a novel paradigm that enables scalable data collection targeting the comprehension of news documents.
The resulting corpus, DCQA, consists of 22,430 question-answer pairs across 607 English documents.
arXiv Detail & Related papers (2021-11-01T04:50:26Z) - A Dataset of Information-Seeking Questions and Answers Anchored in
Research Papers [66.11048565324468]
We present a dataset of 5,049 questions over 1,585 Natural Language Processing papers.
Each question is written by an NLP practitioner who read only the title and abstract of the corresponding paper, and the question seeks information present in the full text.
We find that existing models that do well on other QA tasks do not perform well on answering these questions, underperforming humans by at least 27 F1 points when answering them from entire papers.
arXiv Detail & Related papers (2021-05-07T00:12:34Z) - Inquisitive Question Generation for High Level Text Comprehension [60.21497846332531]
We introduce INQUISITIVE, a dataset of 19K questions that are elicited while a person is reading through a document.
We show that readers engage in a series of pragmatic strategies to seek information.
We evaluate question generation models based on GPT-2 and show that our model is able to generate reasonable questions.
arXiv Detail & Related papers (2020-10-04T19:03:39Z) - R2DE: a NLP approach to estimating IRT parameters of newly generated
questions [3.364554138758565]
R2DE is a model capable of assessing newly generated multiple-choice questions by looking at the text of the question.
In particular, it can estimate the difficulty and the discrimination of each question.
arXiv Detail & Related papers (2020-01-21T14:31:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.