MuDoC: An Interactive Multimodal Document-grounded Conversational AI System
- URL: http://arxiv.org/abs/2502.09843v1
- Date: Fri, 14 Feb 2025 01:05:51 GMT
- Title: MuDoC: An Interactive Multimodal Document-grounded Conversational AI System
- Authors: Karan Taneja, Ashok K. Goel,
- Abstract summary: Building a multimodal document-grounded AI system to interact with long documents remains a challenge.<n>We present an interactive conversational AI agent 'MuDoC' based on GPT-4o to generate document-grounded responses with interleaved text and figures.
- Score: 4.7191037525744735
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal AI is an important step towards building effective tools to leverage multiple modalities in human-AI communication. Building a multimodal document-grounded AI system to interact with long documents remains a challenge. Our work aims to fill the research gap of directly leveraging grounded visuals from documents alongside textual content in documents for response generation. We present an interactive conversational AI agent 'MuDoC' based on GPT-4o to generate document-grounded responses with interleaved text and figures. MuDoC's intelligent textbook interface promotes trustworthiness and enables verification of system responses by allowing instant navigation to source text and figures in the documents. We also discuss qualitative observations based on MuDoC responses highlighting its strengths and limitations.
Related papers
- A Desideratum for Conversational Agents: Capabilities, Challenges, and Future Directions [51.96890647837277]
Large Language Models (LLMs) have propelled conversational AI from traditional dialogue systems into sophisticated agents capable of autonomous actions, contextual awareness, and multi-turn interactions with users.
This survey paper presents a desideratum for next-generation Conversational Agents - what has been achieved, what challenges persist, and what must be done for more scalable systems that approach human-level intelligence.
arXiv Detail & Related papers (2025-04-07T21:01:25Z) - Towards a Multimodal Document-grounded Conversational AI System for Education [5.228830802958218]
We present MuDoC, a Multimodal Document-grounded Conversational AI system based on GPT-4o.
Its interface allows verification of AI generated content through seamless navigation to the source.
Our findings indicate that both visuals and verifiability of content foster learner engagement and trust; however, no significant impact on performance was observed.
arXiv Detail & Related papers (2025-04-04T00:04:19Z) - VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation [100.06122876025063]
This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings.<n>We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG.
arXiv Detail & Related papers (2024-12-14T06:24:55Z) - Unified Multimodal Interleaved Document Representation for Retrieval [57.65409208879344]
We propose a method that holistically embeds documents interleaved with multiple modalities.<n>We merge the representations of segmented passages into one single document representation.<n>We show that our approach substantially outperforms relevant baselines.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - Documentation Practices of Artificial Intelligence [0.5937476291232799]
We provide an overview of prevailing trends, persistent issues, and the interplay of factors influencing the documentation.
Our examination of key characteristics such as scope, target audiences, support for multimodality, and level of automation highlights a shift towards a more holistic, engaging, and automated documentation.
arXiv Detail & Related papers (2024-06-26T08:33:52Z) - KamerRaad: Enhancing Information Retrieval in Belgian National Politics through Hierarchical Summarization and Conversational Interfaces [55.00702535694059]
KamerRaad is an AI tool that leverages large language models to help citizens interactively engage with Belgian political information.
The tool extracts and concisely summarizes key excerpts from parliamentary proceedings, followed by the potential for interaction based on generative AI.
arXiv Detail & Related papers (2024-04-22T15:01:39Z) - FCC: Fusing Conversation History and Candidate Provenance for Contextual
Response Ranking in Dialogue Systems [53.89014188309486]
We present a flexible neural framework that can integrate contextual information from multiple channels.
We evaluate our model on the MSDialog dataset widely used for evaluating conversational response ranking tasks.
arXiv Detail & Related papers (2023-03-31T23:58:28Z) - Enabling Harmonious Human-Machine Interaction with Visual-Context
Augmented Dialogue System: A Review [40.49926141538684]
Visual Context Augmented Dialogue System (VAD) has the potential to communicate with humans by perceiving and understanding multimodal information.
VAD possesses the potential to generate engaging and context-aware responses.
arXiv Detail & Related papers (2022-07-02T09:31:37Z) - End-to-end Spoken Conversational Question Answering: Task, Dataset and
Model [92.18621726802726]
In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts.
We propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogue flows.
Our main objective is to build the system to deal with conversational questions based on the audio recordings, and to explore the plausibility of providing more cues from different modalities with systems in information gathering.
arXiv Detail & Related papers (2022-04-29T17:56:59Z) - DIALKI: Knowledge Identification in Conversational Systems through
Dialogue-Document Contextualization [41.21012318918167]
We introduce a knowledge identification model that leverages the document structure to provide dialogue-contextualized passage encodings.
We demonstrate the effectiveness of our model on two document-grounded conversational datasets.
arXiv Detail & Related papers (2021-09-10T05:40:37Z) - Exploring Recurrent, Memory and Attention Based Architectures for
Scoring Interactional Aspects of Human-Machine Text Dialog [9.209192502526285]
This paper builds on previous work in this direction to investigate multiple neural architectures.
We conduct experiments on a conversational database of text dialogs from human learners interacting with a cloud-based dialog system.
We find that fusion of multiple architectures performs competently on our automated scoring task relative to expert inter-rater agreements.
arXiv Detail & Related papers (2020-05-20T03:23:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.