ZeroGen: Zero-shot Multimodal Controllable Text Generation with Multiple
Oracles
- URL: http://arxiv.org/abs/2306.16649v1
- Date: Thu, 29 Jun 2023 03:22:43 GMT
- Title: ZeroGen: Zero-shot Multimodal Controllable Text Generation with Multiple
Oracles
- Authors: Haoqin Tu, Bowen Yang, Xianfeng Zhao
- Abstract summary: We propose a new paradigm of zero-shot controllable text generation with multimodal signals (textscZeroGen)
textscZeroGen leverages controls of text and image successively from token-level to sentence-level and maps them into a unified probability space at decoding.
We show that textscZeroGen not only outperforms its counterparts on captioning tasks by a large margin but also shows great potential in multimodal news generation with a higher degree of control.
- Score: 29.460712493470453
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatically generating textual content with desired attributes is an
ambitious task that people have pursued long. Existing works have made a series
of progress in incorporating unimodal controls into language models (LMs),
whereas how to generate controllable sentences with multimodal signals and high
efficiency remains an open question. To tackle the puzzle, we propose a new
paradigm of zero-shot controllable text generation with multimodal signals
(\textsc{ZeroGen}). Specifically, \textsc{ZeroGen} leverages controls of text
and image successively from token-level to sentence-level and maps them into a
unified probability space at decoding, which customizes the LM outputs by
weighted addition without extra training. To achieve better inter-modal
trade-offs, we further introduce an effective dynamic weighting mechanism to
regulate all control weights. Moreover, we conduct substantial experiments to
probe the relationship of being in-depth or in-width between signals from
distinct modalities. Encouraging empirical results on three downstream tasks
show that \textsc{ZeroGen} not only outperforms its counterparts on captioning
tasks by a large margin but also shows great potential in multimodal news
generation with a higher degree of control. Our code will be released at
https://github.com/ImKeTT/ZeroGen.
Related papers
- AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation [24.07613591217345]
Linguistic control enables effective content creation, but struggles with fine-grained control over image generation.
AnyControl develops a novel Multi-Control framework that extracts a unified multi-modal embedding to guide the generation process.
This approach enables a holistic understanding of user inputs, and produces high-quality, faithful results under versatile control signals.
arXiv Detail & Related papers (2024-06-27T07:40:59Z) - Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing [17.92378239787507]
We present a decoder-only Discrete Multimodal Language Model (DMLM)
DMLM can be flexibly applied to multiple tasks (ASR, T2S, S2TT, etc.) and modalities (text, speech, vision)
Our results show that DMLM benefits significantly, across multiple tasks and datasets, from a combination of supervised and unsupervised training.
arXiv Detail & Related papers (2024-06-04T20:08:25Z) - From Text to Pixel: Advancing Long-Context Understanding in MLLMs [70.78454154014989]
We introduce SEEKER, a multimodal large language model designed to tackle this issue.
SEEKER aims to optimize the compact encoding of long text by compressing the text sequence into the visual pixel space via images.
Our experiments on six long-context multimodal tasks demonstrate that SEEKER can leverage fewer image tokens to convey the same amount of textual information compared with the OCR-based approach.
arXiv Detail & Related papers (2024-05-23T06:17:23Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Prompt Highlighter: Interactive Control for Multi-Modal LLMs [50.830448437285355]
This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs) inference: explicit controllable text generation.
We introduce a novel inference method, Prompt Highlighter, which enables users to highlight specific prompt spans to interactively control the focus during generation.
We find that, during inference, guiding the models with highlighted tokens through the attention weights leads to more desired outputs.
arXiv Detail & Related papers (2023-12-07T13:53:29Z) - TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models [69.49978333446538]
TEAL is an approach to treat the input from any modality as a token sequence.
It embeds the token sequence into a joint embedding space with a learnable embedding matrix.
Experiments show that TEAL achieves substantial improvements in multi-modal understanding.
arXiv Detail & Related papers (2023-11-08T10:34:16Z) - TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities.
Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.
To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z) - Plug-and-Blend: A Framework for Controllable Story Generation with
Blended Control Codes [11.053902512072813]
We describe a controllable language generation framework, Plug-and-Blend, that allows a human user to input multiple control codes (topics)
In the context of automated story generation, this allows a human user loose or fine-grained control of the topics and transitions between them.
A human participant evaluation shows that the generated stories are observably transitioning between two topics.
arXiv Detail & Related papers (2021-03-23T03:15:14Z) - VX2TEXT: End-to-End Learning of Video-Based Text Generation From
Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.