ZRIGF: An Innovative Multimodal Framework for Zero-Resource
Image-Grounded Dialogue Generation
- URL: http://arxiv.org/abs/2308.00400v2
- Date: Wed, 2 Aug 2023 13:53:30 GMT
- Title: ZRIGF: An Innovative Multimodal Framework for Zero-Resource
Image-Grounded Dialogue Generation
- Authors: Bo Zhang, Jian Wang, Hui Ma, Bo Xu, and Hongfei Lin
- Abstract summary: ZRIGF implements a two-stage learning strategy, comprising contrastive pre-training and generative pre-training.
Comprehensive experiments conducted on both text-based and image-grounded dialogue datasets demonstrate ZRIGF's efficacy in generating contextually pertinent and informative responses.
- Score: 17.310200022696016
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image-grounded dialogue systems benefit greatly from integrating visual
information, resulting in high-quality response generation. However, current
models struggle to effectively utilize such information in zero-resource
scenarios, mainly due to the disparity between image and text modalities. To
overcome this challenge, we propose an innovative multimodal framework, called
ZRIGF, which assimilates image-grounded information for dialogue generation in
zero-resource situations. ZRIGF implements a two-stage learning strategy,
comprising contrastive pre-training and generative pre-training. Contrastive
pre-training includes a text-image matching module that maps images and texts
into a unified encoded vector space, along with a text-assisted masked image
modeling module that preserves pre-training visual features and fosters further
multimodal feature alignment. Generative pre-training employs a multimodal
fusion module and an information transfer module to produce insightful
responses based on harmonized multimodal representations. Comprehensive
experiments conducted on both text-based and image-grounded dialogue datasets
demonstrate ZRIGF's efficacy in generating contextually pertinent and
informative responses. Furthermore, we adopt a fully zero-resource scenario in
the image-grounded dialogue dataset to demonstrate our framework's robust
generalization capabilities in novel domains. The code is available at
https://github.com/zhangbo-nlp/ZRIGF.
Related papers
- BI-MDRG: Bridging Image History in Multimodal Dialogue Response Generation [21.052101309555464]
Multimodal Dialogue Response Generation (MDRG) is a recently proposed task where the model needs to generate responses in texts, images, or a blend of both.
Previous work relies on the text modality as an intermediary step for both the image input and output of the model rather than adopting an end-to-end approach.
We propose BI-MDRG that bridges the response generation path such that the image history information is utilized for enhanced relevance of text responses to the image content.
arXiv Detail & Related papers (2024-08-12T05:22:42Z) - Advanced Multimodal Deep Learning Architecture for Image-Text Matching [33.8315200009152]
Image-text matching is a key multimodal task that aims to model the semantic association between images and text as a matching relationship.
We introduce an advanced multimodal deep learning architecture, which combines the high-level abstract representation ability of deep neural networks for visual information with the advantages of natural language processing models for text semantic understanding.
Experiments show that compared with existing image-text matching models, the optimized new model has significantly improved performance on a series of benchmark data sets.
arXiv Detail & Related papers (2024-06-13T08:32:24Z) - Unified Text-to-Image Generation and Retrieval [96.72318842152148]
We propose a unified framework in the context of Multimodal Large Language Models (MLLMs)
We first explore the intrinsic discrimi abilities of MLLMs and introduce a generative retrieval method to perform retrieval in a training-free manner.
We then unify generation and retrieval in an autoregressive generation way and propose an autonomous decision module to choose the best-matched one between generated and retrieved images.
arXiv Detail & Related papers (2024-06-09T15:00:28Z) - TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding [91.30065932213758]
Large Multimodal Models (LMMs) have sparked a surge in research aimed at harnessing their remarkable reasoning abilities.
We propose TextCoT, a novel Chain-of-Thought framework for text-rich image understanding.
Our method is free of extra training, offering immediate plug-and-play functionality.
arXiv Detail & Related papers (2024-04-15T13:54:35Z) - MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets [29.737965533532577]
Multimodal Augmented Generative Images Dialogues (MAGID) is a framework to augment text-only dialogues with diverse and high-quality images.
Our results show that MAGID is comparable to or better than baselines, with significant improvements in human evaluation.
arXiv Detail & Related papers (2024-03-05T18:31:28Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion [36.06457895469353]
UNIMO-G is a conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs.
It excels in both text-to-image generation and zero-shot subject-driven synthesis.
arXiv Detail & Related papers (2024-01-24T11:36:44Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE.
We propose a novel Multi-modal Retrieval based framework (MoRe)
MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.