Towards a Multimodal Document-grounded Conversational AI System for Education
- URL: http://arxiv.org/abs/2504.13884v1
- Date: Fri, 04 Apr 2025 00:04:19 GMT
- Title: Towards a Multimodal Document-grounded Conversational AI System for Education
- Authors: Karan Taneja, Anjali Singh, Ashok K. Goel,
- Abstract summary: We present MuDoC, a Multimodal Document-grounded Conversational AI system based on GPT-4o.<n>Its interface allows verification of AI generated content through seamless navigation to the source.<n>Our findings indicate that both visuals and verifiability of content foster learner engagement and trust; however, no significant impact on performance was observed.
- Score: 5.228830802958218
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimedia learning using text and images has been shown to improve learning outcomes compared to text-only instruction. But conversational AI systems in education predominantly rely on text-based interactions while multimodal conversations for multimedia learning remain unexplored. Moreover, deploying conversational AI in learning contexts requires grounding in reliable sources and verifiability to create trust. We present MuDoC, a Multimodal Document-grounded Conversational AI system based on GPT-4o, that leverages both text and visuals from documents to generate responses interleaved with text and images. Its interface allows verification of AI generated content through seamless navigation to the source. We compare MuDoC to a text-only system to explore differences in learner engagement, trust in AI system, and their performance on problem-solving tasks. Our findings indicate that both visuals and verifiability of content enhance learner engagement and foster trust; however, no significant impact in performance was observed. We draw upon theories from cognitive and learning sciences to interpret the findings and derive implications, and outline future directions for the development of multimodal conversational AI systems in education.
Related papers
- MuDoC: An Interactive Multimodal Document-grounded Conversational AI System [4.7191037525744735]
Building a multimodal document-grounded AI system to interact with long documents remains a challenge.<n>We present an interactive conversational AI agent 'MuDoC' based on GPT-4o to generate document-grounded responses with interleaved text and figures.
arXiv Detail & Related papers (2025-02-14T01:05:51Z) - TECO: Improving Multimodal Intent Recognition with Text Enhancement through Commonsense Knowledge Extraction [0.0]
This paper proposes a Text Enhancement with CommOnsense Knowledge Extractor (TECO) to address these challenges.<n>We begin by extracting relations from both generated and retrieved knowledge to enrich the contextual information in the text modality.<n>We then align and integrate visual and acoustic representations with these enhanced text features to form a cohesive multimodal representation.
arXiv Detail & Related papers (2024-12-11T16:38:48Z) - Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object
Detection [72.36017150922504]
We propose a multi-modal contextual knowledge distillation framework, MMC-Det, to transfer the learned contextual knowledge from a teacher fusion transformer to a student detector.
The diverse multi-modal masked language modeling is realized by an object divergence constraint upon traditional multi-modal masked language modeling (MLM)
arXiv Detail & Related papers (2023-08-30T08:33:13Z) - Channel-aware Decoupling Network for Multi-turn Dialogue Comprehension [81.47133615169203]
We propose compositional learning for holistic interaction across utterances beyond the sequential contextualization from PrLMs.
We employ domain-adaptive training strategies to help the model adapt to the dialogue domains.
Experimental results show that our method substantially boosts the strong PrLM baselines in four public benchmark datasets.
arXiv Detail & Related papers (2023-01-10T13:18:25Z) - Vision+X: A Survey on Multimodal Learning in the Light of Data [64.03266872103835]
multimodal machine learning that incorporates data from various sources has become an increasingly popular research area.
We analyze the commonness and uniqueness of each data format mainly ranging from vision, audio, text, and motions.
We investigate the existing literature on multimodal learning from both the representation learning and downstream application levels.
arXiv Detail & Related papers (2022-10-05T13:14:57Z) - Multimodal Lecture Presentations Dataset: Understanding Multimodality in
Educational Slides [57.86931911522967]
We test the capabilities of machine learning models in multimodal understanding of educational content.
Our dataset contains aligned slides and spoken language, for 180+ hours of video and 9000+ slides, with 10 lecturers from various subjects.
We introduce PolyViLT, a multimodal transformer trained with a multi-instance learning loss that is more effective than current approaches.
arXiv Detail & Related papers (2022-08-17T05:30:18Z) - Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model [63.461030694700014]
We propose a novel dual knowledge-enhanced generative pretrained language model for multimodal task-oriented dialog systems (DKMD)
The proposed DKMD consists of three key components: dual knowledge selection, dual knowledge-enhanced context learning, and knowledge-enhanced response generation.
Experiments on a public dataset verify the superiority of the proposed DKMD over state-of-the-art competitors.
arXiv Detail & Related papers (2022-07-16T13:02:54Z) - Enabling Harmonious Human-Machine Interaction with Visual-Context
Augmented Dialogue System: A Review [40.49926141538684]
Visual Context Augmented Dialogue System (VAD) has the potential to communicate with humans by perceiving and understanding multimodal information.
VAD possesses the potential to generate engaging and context-aware responses.
arXiv Detail & Related papers (2022-07-02T09:31:37Z) - Knowledge Augmented BERT Mutual Network in Multi-turn Spoken Dialogues [6.4144180888492075]
We propose to equip a BERT-based joint model with a knowledge attention module to mutually leverage dialogue contexts between two SLU tasks.
A gating mechanism is further utilized to filter out irrelevant knowledge triples and to circumvent distracting comprehension.
Experimental results in two complicated multi-turn dialogue datasets have demonstrate by mutually modeling two SLU tasks with filtered knowledge and dialogue contexts.
arXiv Detail & Related papers (2022-02-23T04:03:35Z) - Contrastive Representation Learning: A Framework and Review [2.7393821783237184]
The origins of Contrastive Learning date as far back as the 1990s and its development has spanned across many fields.
We propose a general Contrastive Representation Learning framework that simplifies and unifies many different contrastive learning methods.
Examples of how contrastive learning has been applied in computer vision, natural language processing, audio processing, and others, as well as in Reinforcement Learning are also presented.
arXiv Detail & Related papers (2020-10-10T22:46:25Z) - Improving Machine Reading Comprehension with Contextualized Commonsense
Knowledge [62.46091695615262]
We aim to extract commonsense knowledge to improve machine reading comprehension.
We propose to represent relations implicitly by situating structured knowledge in a context.
We employ a teacher-student paradigm to inject multiple types of contextualized knowledge into a student machine reader.
arXiv Detail & Related papers (2020-09-12T17:20:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.