Knowledge Pursuit Prompting for Zero-Shot Multimodal Synthesis
- URL: http://arxiv.org/abs/2311.17898v2
- Date: Thu, 30 Nov 2023 18:59:01 GMT
- Title: Knowledge Pursuit Prompting for Zero-Shot Multimodal Synthesis
- Authors: Jinqi Luo, Kwan Ho Ryan Chan, Dimitris Dimos, Ren\'e Vidal
- Abstract summary: Hallucinations and unfaithful synthesis due to inaccurate prompts with insufficient semantic details are widely observed in multimodal generative models.
We propose Knowledge Pursuit Prompting (KPP), a zero-shot framework that iteratively incorporates external knowledge to help generators produce reliable visual content.
KPP is capable of generating faithful and semantically rich content across diverse visual domains, offering a promising solution to improve multimodal generative models.
- Score: 6.215536001787723
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Hallucinations and unfaithful synthesis due to inaccurate prompts with
insufficient semantic details are widely observed in multimodal generative
models. A prevalent strategy to align multiple modalities is to fine-tune the
generator with a large number of annotated text-image pairs. However, such a
procedure is labor-consuming and resource-draining. The key question we ask is:
can we enhance the quality and faithfulness of text-driven generative models
beyond extensive text-image pair annotations? To address this question, we
propose Knowledge Pursuit Prompting (KPP), a zero-shot framework that
iteratively incorporates external knowledge to help generators produce reliable
visual content. Instead of training generators to handle generic prompts, KPP
employs a recursive knowledge query process to gather informative external
facts from the knowledge base, instructs a language model to compress the
acquired knowledge for prompt refinement, and utilizes text-driven generators
for visual synthesis. The entire process is zero-shot, without accessing the
architectures and parameters of generative models. We evaluate the framework
across multiple text-driven generative tasks (image, 3D rendering, and video)
on datasets of different domains. We further demonstrate the extensibility and
adaptability of KPP through varying foundation model bases and instructions.
Our results show that KPP is capable of generating faithful and semantically
rich content across diverse visual domains, offering a promising solution to
improve multimodal generative models.
Related papers
- Multi-Prompts Learning with Cross-Modal Alignment for Attribute-based
Person Re-Identification [18.01407937934588]
We present a new framework called Multi-Prompts ReID (MP-ReID) based on prompt learning and language models.
MP-ReID learns to hallucinate diverse, informative, and promptable sentences for describing the query images.
Explicit prompts are obtained by ensembling generation models, such as ChatGPT and VQA models.
arXiv Detail & Related papers (2023-12-28T03:00:19Z) - Few-shot Action Recognition with Captioning Foundation Models [61.40271046233581]
CapFSAR is a framework to exploit knowledge of multimodal models without manually annotating text.
Visual-text aggregation module based on Transformer is further designed to incorporate cross-modal-temporal complementary information.
experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods.
arXiv Detail & Related papers (2023-10-16T07:08:39Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal
Pre-trained Knowledge [44.31783230767321]
We propose a plug-and-play framework, i.e. CapEnrich, to complement the generic image descriptions with more semantic details.
Our method significantly improves the descriptiveness and diversity of generated sentences for web images.
arXiv Detail & Related papers (2022-11-17T06:55:49Z) - An Overview on Controllable Text Generation via Variational
Auto-Encoders [15.97186478109836]
Recent advances in neural-based generative modeling have reignited the hopes of having computer systems capable of conversing with humans.
Latent variable models (LVM) such as variational auto-encoders (VAEs) are designed to characterize the distributional pattern of textual data.
This overview gives an introduction to existing generation schemes, problems associated with text variational auto-encoders, and a review of several applications about the controllable generation.
arXiv Detail & Related papers (2022-11-15T07:36:11Z) - MuRAG: Multimodal Retrieval-Augmented Generator for Open Question
Answering over Images and Text [58.655375327681774]
We propose the first Multimodal Retrieval-Augmented Transformer (MuRAG)
MuRAG accesses an external non-parametric multimodal memory to augment language generation.
Our results show that MuRAG achieves state-of-the-art accuracy, outperforming existing models by 10-20% absolute on both datasets.
arXiv Detail & Related papers (2022-10-06T13:58:03Z) - Generate rather than Retrieve: Large Language Models are Strong Context
Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators.
We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen
Language Models [57.557319372969495]
Large-scale auto-regressive language models pretrained on massive text have demonstrated their impressive ability to perform new natural language tasks.
Recent studies further show that such a few-shot learning ability can be extended to the text-image setting by training an encoder to encode the images into embeddings.
We propose a novel speech understanding framework, WavPrompt, where we finetune a wav2vec model to generate a sequence of audio embeddings understood by the language model.
arXiv Detail & Related papers (2022-03-29T19:08:55Z) - External Knowledge Augmented Text Visual Question Answering [0.6445605125467573]
We propose a framework to extract, filter, and encode knowledge atop a standard multimodal transformer for vision language understanding tasks.
We generate results comparable to the state-of-the-art on two publicly available datasets.
arXiv Detail & Related papers (2021-08-22T13:21:58Z) - Knowledge Graph-Augmented Abstractive Summarization with Semantic-Driven
Cloze Reward [42.925345819778656]
We present ASGARD, a novel framework for Abstractive Summarization with Graph-Augmentation and semantic-driven RewarD.
We propose the use of dual encoders---a sequential document encoder and a graph-structured encoder---to maintain the global context and local characteristics of entities.
Results show that our models produce significantly higher ROUGE scores than a variant without knowledge graph as input on both New York Times and CNN/Daily Mail datasets.
arXiv Detail & Related papers (2020-05-03T18:23:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.