MuDoC: An Interactive Multimodal Document-grounded Conversational AI System
- URL: http://arxiv.org/abs/2502.09843v1
- Date: Fri, 14 Feb 2025 01:05:51 GMT
- Title: MuDoC: An Interactive Multimodal Document-grounded Conversational AI System
- Authors: Karan Taneja, Ashok K. Goel,
- Abstract summary: Building a multimodal document-grounded AI system to interact with long documents remains a challenge.
We present an interactive conversational AI agent 'MuDoC' based on GPT-4o to generate document-grounded responses with interleaved text and figures.
- Score: 4.7191037525744735
- License:
- Abstract: Multimodal AI is an important step towards building effective tools to leverage multiple modalities in human-AI communication. Building a multimodal document-grounded AI system to interact with long documents remains a challenge. Our work aims to fill the research gap of directly leveraging grounded visuals from documents alongside textual content in documents for response generation. We present an interactive conversational AI agent 'MuDoC' based on GPT-4o to generate document-grounded responses with interleaved text and figures. MuDoC's intelligent textbook interface promotes trustworthiness and enables verification of system responses by allowing instant navigation to source text and figures in the documents. We also discuss qualitative observations based on MuDoC responses highlighting its strengths and limitations.
Related papers
- VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation [100.06122876025063]
This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings.
We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG.
arXiv Detail & Related papers (2024-12-14T06:24:55Z) - Unified Multimodal Interleaved Document Representation for Retrieval [57.65409208879344]
We propose a method that holistically embeds documents interleaved with multiple modalities.
We merge the representations of segmented passages into one single document representation.
We show that our approach substantially outperforms relevant baselines.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - Documentation Practices of Artificial Intelligence [0.5937476291232799]
We provide an overview of prevailing trends, persistent issues, and the interplay of factors influencing the documentation.
Our examination of key characteristics such as scope, target audiences, support for multimodality, and level of automation highlights a shift towards a more holistic, engaging, and automated documentation.
arXiv Detail & Related papers (2024-06-26T08:33:52Z) - KamerRaad: Enhancing Information Retrieval in Belgian National Politics through Hierarchical Summarization and Conversational Interfaces [55.00702535694059]
KamerRaad is an AI tool that leverages large language models to help citizens interactively engage with Belgian political information.
The tool extracts and concisely summarizes key excerpts from parliamentary proceedings, followed by the potential for interaction based on generative AI.
arXiv Detail & Related papers (2024-04-22T15:01:39Z) - FCC: Fusing Conversation History and Candidate Provenance for Contextual
Response Ranking in Dialogue Systems [53.89014188309486]
We present a flexible neural framework that can integrate contextual information from multiple channels.
We evaluate our model on the MSDialog dataset widely used for evaluating conversational response ranking tasks.
arXiv Detail & Related papers (2023-03-31T23:58:28Z) - Enabling Harmonious Human-Machine Interaction with Visual-Context
Augmented Dialogue System: A Review [40.49926141538684]
Visual Context Augmented Dialogue System (VAD) has the potential to communicate with humans by perceiving and understanding multimodal information.
VAD possesses the potential to generate engaging and context-aware responses.
arXiv Detail & Related papers (2022-07-02T09:31:37Z) - End-to-end Spoken Conversational Question Answering: Task, Dataset and
Model [92.18621726802726]
In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts.
We propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogue flows.
Our main objective is to build the system to deal with conversational questions based on the audio recordings, and to explore the plausibility of providing more cues from different modalities with systems in information gathering.
arXiv Detail & Related papers (2022-04-29T17:56:59Z) - DIALKI: Knowledge Identification in Conversational Systems through
Dialogue-Document Contextualization [41.21012318918167]
We introduce a knowledge identification model that leverages the document structure to provide dialogue-contextualized passage encodings.
We demonstrate the effectiveness of our model on two document-grounded conversational datasets.
arXiv Detail & Related papers (2021-09-10T05:40:37Z) - Multi-View Sequence-to-Sequence Models with Conversational Structure for
Abstractive Dialogue Summarization [72.54873655114844]
Text summarization is one of the most challenging and interesting problems in NLP.
This work proposes a multi-view sequence-to-sequence model by first extracting conversational structures of unstructured daily chats from different views to represent conversations.
Experiments on a large-scale dialogue summarization corpus demonstrated that our methods significantly outperformed previous state-of-the-art models via both automatic evaluations and human judgment.
arXiv Detail & Related papers (2020-10-04T20:12:44Z) - Exploring Recurrent, Memory and Attention Based Architectures for
Scoring Interactional Aspects of Human-Machine Text Dialog [9.209192502526285]
This paper builds on previous work in this direction to investigate multiple neural architectures.
We conduct experiments on a conversational database of text dialogs from human learners interacting with a cloud-based dialog system.
We find that fusion of multiple architectures performs competently on our automated scoring task relative to expert inter-rater agreements.
arXiv Detail & Related papers (2020-05-20T03:23:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.