Building Multimodal AI Chatbots
- URL: http://arxiv.org/abs/2305.03512v1
- Date: Fri, 21 Apr 2023 16:43:54 GMT
- Title: Building Multimodal AI Chatbots
- Authors: Min Young Lee
- Abstract summary: This work aims to create a multimodal AI system that chats with humans and shares relevant photos.
It proposes two multimodal deep learning models: an image retriever that understands texts and a response generator that understands images.
The two models are trained and evaluated on PhotoChat, an open-domain dialogue dataset in which a photo is shared in each session.
- Score: 2.1987180245567246
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work aims to create a multimodal AI system that chats with humans and
shares relevant photos. While earlier works were limited to dialogues about
specific objects or scenes within images, recent works have incorporated images
into open-domain dialogues. However, their response generators are unimodal,
accepting text input but no image input, thus prone to generating responses
contradictory to the images shared in the dialogue. Therefore, this work
proposes a complete chatbot system using two multimodal deep learning models:
an image retriever that understands texts and a response generator that
understands images. The image retriever, implemented by ViT and BERT, selects
the most relevant image given the dialogue history and a database of images.
The response generator, implemented by ViT and GPT-2/DialoGPT, generates an
appropriate response given the dialogue history and the most recently retrieved
image. The two models are trained and evaluated on PhotoChat, an open-domain
dialogue dataset in which a photo is shared in each session. In automatic
evaluation, the proposed image retriever outperforms existing baselines VSE++
and SCAN with Recall@1/5/10 of 0.1/0.3/0.4 and MRR of 0.2 when ranking 1,000
images. The proposed response generator also surpasses the baseline Divter with
PPL of 16.9, BLEU-1/2 of 0.13/0.03, and Distinct-1/2 of 0.97/0.86, showing a
significant improvement in PPL by -42.8 and BLEU-1/2 by +0.07/0.02. In human
evaluation with a Likert scale of 1-5, the complete multimodal chatbot system
receives higher image-groundedness of 4.3 and engagingness of 4.3, along with
competitive fluency of 4.1, coherence of 3.9, and humanness of 3.1, when
compared to other chatbot variants. The source code is available at:
https://github.com/minniie/multimodal_chat.git.
Related papers
- Self-Directed Turing Test for Large Language Models [56.64615470513102]
The Turing test examines whether AIs can exhibit human-like behaviour in natural language conversations.
Traditional Turing tests adopt a rigid dialogue format where each participant sends only one message each time.
This paper proposes the Self-Directed Turing Test, which extends the original test with a burst dialogue format.
arXiv Detail & Related papers (2024-08-19T09:57:28Z) - BI-MDRG: Bridging Image History in Multimodal Dialogue Response Generation [21.052101309555464]
Multimodal Dialogue Response Generation (MDRG) is a recently proposed task where the model needs to generate responses in texts, images, or a blend of both.
Previous work relies on the text modality as an intermediary step for both the image input and output of the model rather than adopting an end-to-end approach.
We propose BI-MDRG that bridges the response generation path such that the image history information is utilized for enhanced relevance of text responses to the image content.
arXiv Detail & Related papers (2024-08-12T05:22:42Z) - Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation [55.043492250775294]
We introduce a novel Face-to-Face spoken dialogue model.
It processes audio-visual speech from user input and generates audio-visual speech as the response.
We also introduce MultiDialog, the first large-scale multimodal spoken dialogue corpus.
arXiv Detail & Related papers (2024-06-12T04:48:36Z) - MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets [29.737965533532577]
Multimodal Augmented Generative Images Dialogues (MAGID) is a framework to augment text-only dialogues with diverse and high-quality images.
Our results show that MAGID is comparable to or better than baselines, with significant improvements in human evaluation.
arXiv Detail & Related papers (2024-03-05T18:31:28Z) - Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models [60.81438804824749]
Multimodal instruction-following models extend capabilities by integrating both text and images.
Existing models such as MiniGPT-4 and LLaVA face challenges in maintaining dialogue coherence in scenarios involving multiple images.
We introduce SparklesDialogue, the first machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions.
We then present SparklesChat, a multimodal instruction-following model for open-ended dialogues across multiple images.
arXiv Detail & Related papers (2023-08-31T05:15:27Z) - Chatting Makes Perfect: Chat-based Image Retrieval [25.452015862927766]
ChatIR is a chat-based image retrieval system that engages in a conversation with the user to elicit information.
Large Language Models are used to generate follow-up questions to an initial image description.
Our system is capable of retrieving the target image from a pool of 50K images with over 78% success rate after 5 dialogue rounds.
arXiv Detail & Related papers (2023-05-31T17:38:08Z) - Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with
Text [130.89493542553151]
In-context vision and language models like Flamingo support arbitrarily interleaved sequences of images and text as input.
To support this interface, pretraining occurs over web corpora that similarly contain interleaved images+text.
We release Multimodal C4, an augmentation of the popular text-only C4 corpus with images interleaved.
arXiv Detail & Related papers (2023-04-14T06:17:46Z) - TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation
with Question Answering [86.38098280689027]
We introduce an automatic evaluation metric that measures the faithfulness of a generated image to its text input via visual question answering (VQA)
We present a comprehensive evaluation of existing text-to-image models using a benchmark consisting of 4K diverse text inputs and 25K questions across 12 categories (object, counting, etc.)
arXiv Detail & Related papers (2023-03-21T14:41:02Z) - Pchatbot: A Large-Scale Dataset for Personalized Chatbot [49.16746174238548]
We introduce Pchatbot, a large-scale dialogue dataset that contains two subsets collected from Weibo and Judicial forums respectively.
To adapt the raw dataset to dialogue systems, we elaborately normalize the raw dataset via processes such as anonymization.
The scale of Pchatbot is significantly larger than existing Chinese datasets, which might benefit the data-driven models.
arXiv Detail & Related papers (2020-09-28T12:49:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.