Interactive Sketchpad: A Multimodal Tutoring System for Collaborative, Visual Problem-Solving
- URL: http://arxiv.org/abs/2503.16434v2
- Date: Wed, 02 Apr 2025 01:03:51 GMT
- Title: Interactive Sketchpad: A Multimodal Tutoring System for Collaborative, Visual Problem-Solving
- Authors: Steven-Shine Chen, Jimin Lee, Paul Pu Liang,
- Abstract summary: This paper introduces Interactive Sketchpad, a tutoring system that combines language-based explanations with interactive visualizations to enhance learning.<n>User studies conducted on math problems such as geometry, calculus, and demonstrate that Interactive Sketchpad leads to improved task comprehension, problem-solving accuracy, and engagement levels.
- Score: 25.22658210339668
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans have long relied on visual aids like sketches and diagrams to support reasoning and problem-solving. Visual tools, like auxiliary lines in geometry or graphs in calculus, are essential for understanding complex ideas. However, many tutoring systems remain text-based, providing feedback only through natural language. Leveraging recent advances in Large Multimodal Models (LMMs), this paper introduces Interactive Sketchpad, a tutoring system that combines language-based explanations with interactive visualizations to enhance learning. Built on a pre-trained LMM, Interactive Sketchpad is fine-tuned to provide step-by-step guidance in both text and visuals, enabling natural multimodal interaction with the student. Accurate and robust diagrams are generated by incorporating code execution into the reasoning process. User studies conducted on math problems such as geometry, calculus, and trigonometry demonstrate that Interactive Sketchpad leads to improved task comprehension, problem-solving accuracy, and engagement levels, highlighting its potential for transforming educational technologies. All code is available at: https://stevenshinechen.github.io/interactivesketchpad/.
Related papers
- MathMistake Checker: A Comprehensive Demonstration for Step-by-Step Math Problem Mistake Finding by Prompt-Guided LLMs [13.756898876556455]
We propose a novel system, MathMistake Checker, to automate step-by-step mistake finding in mathematical problems with lengthy answers.<n>The system aims to simplify grading, increase efficiency, and enhance learning experiences from a pedagogical perspective.
arXiv Detail & Related papers (2025-03-06T10:19:01Z) - Prompt Programming: A Platform for Dialogue-based Computational Problem Solving with Generative AI Models [22.339868419855904]
Students increasingly rely on generative AI tools for programming assistance, often without formal instruction or guidance.
This highlights a need to teach students how to effectively interact with AI models.
We developed a novel platform for prompt programming that enables authentic dialogue-based interactions.
arXiv Detail & Related papers (2025-03-06T09:56:07Z) - VISTA: Visual Integrated System for Tailored Automation in Math Problem Generation Using LLM [0.5383910843560784]
This paper introduces a novel multi-agent framework that leverages Large Language Models (LLMs) to automate the creation of complex mathematical visualizations alongside coherent problem text.
Our approach not only simplifies the generation of precise visual aids but also aligns these aids with the problem's core mathematical concepts, improving both problem creation and assessment.
arXiv Detail & Related papers (2024-11-08T09:15:56Z) - Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models [139.9581209765338]
Sketchpad is a framework that gives multimodal LMs a visual sketchpad and tools to draw on the sketchpad.
It enables LMs to draw with lines, boxes, marks, etc., which is closer to human sketching and better facilitates reasoning.
Sketchpad substantially improves performance on all tasks over strong base models with no sketching.
arXiv Detail & Related papers (2024-06-13T17:59:31Z) - Describe-then-Reason: Improving Multimodal Mathematical Reasoning through Visual Comprehension Training [24.989732666940153]
Open-source multimodal large language models (MLLMs) excel in various tasks involving textual and visual inputs.
MLLMs still struggle with complex multimodal mathematical reasoning, lagging behind proprietary models like GPT-4V(ision) and Gemini-Pro.
We propose a two-step training pipeline VCAR, which emphasizes the Visual Reasoning training in addition to mathematical learning.
arXiv Detail & Related papers (2024-04-22T21:59:35Z) - Visual Programming: Compositional visual reasoning without training [24.729624386851388]
VISPROG is a neuro-symbolic approach to solving complex and compositional visual tasks.
It uses the in-context learning ability of large language models to generate python-like modular programs.
arXiv Detail & Related papers (2022-11-18T18:50:09Z) - Multimodal Lecture Presentations Dataset: Understanding Multimodality in
Educational Slides [57.86931911522967]
We test the capabilities of machine learning models in multimodal understanding of educational content.
Our dataset contains aligned slides and spoken language, for 180+ hours of video and 9000+ slides, with 10 lecturers from various subjects.
We introduce PolyViLT, a multimodal transformer trained with a multi-instance learning loss that is more effective than current approaches.
arXiv Detail & Related papers (2022-08-17T05:30:18Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - I Know What You Draw: Learning Grasp Detection Conditioned on a Few
Freehand Sketches [74.63313641583602]
We propose a method to generate a potential grasp configuration relevant to the sketch-depicted objects.
Our model is trained and tested in an end-to-end manner which is easy to be implemented in real-world applications.
arXiv Detail & Related papers (2022-05-09T04:23:36Z) - SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning.
We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z) - Learning Adaptive Language Interfaces through Decomposition [89.21937539950966]
We introduce a neural semantic parsing system that learns new high-level abstractions through decomposition.
Users interactively teach the system by breaking down high-level utterances describing novel behavior into low-level steps.
arXiv Detail & Related papers (2020-10-11T08:27:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.