Building Multimodal AI Chatbots
- URL: http://arxiv.org/abs/2305.03512v1
- Date: Fri, 21 Apr 2023 16:43:54 GMT
- Title: Building Multimodal AI Chatbots
- Authors: Min Young Lee
- Abstract summary: This work aims to create a multimodal AI system that chats with humans and shares relevant photos.
It proposes two multimodal deep learning models: an image retriever that understands texts and a response generator that understands images.
The two models are trained and evaluated on PhotoChat, an open-domain dialogue dataset in which a photo is shared in each session.
- Score: 2.1987180245567246
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work aims to create a multimodal AI system that chats with humans and
shares relevant photos. While earlier works were limited to dialogues about
specific objects or scenes within images, recent works have incorporated images
into open-domain dialogues. However, their response generators are unimodal,
accepting text input but no image input, thus prone to generating responses
contradictory to the images shared in the dialogue. Therefore, this work
proposes a complete chatbot system using two multimodal deep learning models:
an image retriever that understands texts and a response generator that
understands images. The image retriever, implemented by ViT and BERT, selects
the most relevant image given the dialogue history and a database of images.
The response generator, implemented by ViT and GPT-2/DialoGPT, generates an
appropriate response given the dialogue history and the most recently retrieved
image. The two models are trained and evaluated on PhotoChat, an open-domain
dialogue dataset in which a photo is shared in each session. In automatic
evaluation, the proposed image retriever outperforms existing baselines VSE++
and SCAN with Recall@1/5/10 of 0.1/0.3/0.4 and MRR of 0.2 when ranking 1,000
images. The proposed response generator also surpasses the baseline Divter with
PPL of 16.9, BLEU-1/2 of 0.13/0.03, and Distinct-1/2 of 0.97/0.86, showing a
significant improvement in PPL by -42.8 and BLEU-1/2 by +0.07/0.02. In human
evaluation with a Likert scale of 1-5, the complete multimodal chatbot system
receives higher image-groundedness of 4.3 and engagingness of 4.3, along with
competitive fluency of 4.1, coherence of 3.9, and humanness of 3.1, when
compared to other chatbot variants. The source code is available at:
https://github.com/minniie/multimodal_chat.git.
Related papers
- Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation [55.043492250775294]
We introduce a novel Face-to-Face spoken dialogue model.
It processes audio-visual speech from user input and generates audio-visual speech as the response.
We also introduce MultiDialog, the first large-scale multimodal spoken dialogue corpus.
arXiv Detail & Related papers (2024-06-12T04:48:36Z) - MAGID: An Automated Pipeline for Generating Synthetic Multi-modal
Datasets [30.72744231027204]
Development of multimodal interactive systems is hindered by the lack of rich, multimodal (text, images) conversational data.
We introduce textbfMultimodal textbfAugmented textbfGenerative textbfImages textbfDialogues (MAGID) to augment text-only dialogues with diverse and high-quality images.
arXiv Detail & Related papers (2024-03-05T18:31:28Z) - Compress & Align: Curating Image-Text Data with Human Knowledge [36.34714164235438]
This paper introduces a novel algorithm, rooted in human knowledge, to compress web-crawled image-text datasets to a compact and high-quality form.
A reward model on the annotated dataset internalizes the nuanced human understanding of image-text alignment.
Experiments demonstrate that we are able to secure (or even improve) model performance by compressing the image-text datasets up to 90%.
arXiv Detail & Related papers (2023-12-11T05:57:09Z) - Towards Better Multi-modal Keyphrase Generation via Visual Entity
Enhancement and Multi-granularity Image Noise Filtering [79.44443231700201]
Multi-modal keyphrase generation aims to produce a set of keyphrases that represent the core points of the input text-image pair.
The input text and image are often not perfectly matched, and thus the image may introduce noise into the model.
We propose a novel multi-modal keyphrase generation model, which not only enriches the model input with external knowledge, but also effectively filters image noise.
arXiv Detail & Related papers (2023-09-09T09:41:36Z) - Sparkles: Unlocking Chats Across Multiple Images for Multimodal
Instruction-Following Models [64.43988773982852]
We present SparklesChat, a multimodal instruction-following model for open-ended dialogues across multiple images.
To support the training, we introduce Sparklesue, the first machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions.
Our experiments validate the effectiveness of SparklesChat in understanding and reasoning across multiple images and dialogue turns.
arXiv Detail & Related papers (2023-08-31T05:15:27Z) - Chatting Makes Perfect: Chat-based Image Retrieval [25.452015862927766]
ChatIR is a chat-based image retrieval system that engages in a conversation with the user to elicit information.
Large Language Models are used to generate follow-up questions to an initial image description.
Our system is capable of retrieving the target image from a pool of 50K images with over 78% success rate after 5 dialogue rounds.
arXiv Detail & Related papers (2023-05-31T17:38:08Z) - Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with
Text [130.89493542553151]
In-context vision and language models like Flamingo support arbitrarily interleaved sequences of images and text as input.
To support this interface, pretraining occurs over web corpora that similarly contain interleaved images+text.
We release Multimodal C4, an augmentation of the popular text-only C4 corpus with images interleaved.
arXiv Detail & Related papers (2023-04-14T06:17:46Z) - TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation
with Question Answering [86.38098280689027]
We introduce an automatic evaluation metric that measures the faithfulness of a generated image to its text input via visual question answering (VQA)
We present a comprehensive evaluation of existing text-to-image models using a benchmark consisting of 4K diverse text inputs and 25K questions across 12 categories (object, counting, etc.)
arXiv Detail & Related papers (2023-03-21T14:41:02Z) - Pchatbot: A Large-Scale Dataset for Personalized Chatbot [49.16746174238548]
We introduce Pchatbot, a large-scale dialogue dataset that contains two subsets collected from Weibo and Judicial forums respectively.
To adapt the raw dataset to dialogue systems, we elaborately normalize the raw dataset via processes such as anonymization.
The scale of Pchatbot is significantly larger than existing Chinese datasets, which might benefit the data-driven models.
arXiv Detail & Related papers (2020-09-28T12:49:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.