On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization
- URL: http://arxiv.org/abs/2205.11686v1
- Date: Tue, 24 May 2022 00:52:40 GMT
- Title: On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization
- Authors: Shruti Palaskar, Akshita Bhagia, Yonatan Bisk, Florian Metze, Alan W
Black and Ana Marasovic
- Abstract summary: We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
- Score: 89.94078728495423
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Integrating vision and language has gained notable attention following the
success of pretrained language models. Despite that, a fraction of emerging
multimodal models is suitable for text generation conditioned on images. This
minority is typically developed and evaluated for image captioning, a text
generation task conditioned solely on images with the goal to describe what is
explicitly visible in an image. In this paper, we take a step back and ask: How
do these models work for more complex generative tasks, conditioned on both
text and images? Are models based on joint multimodal pretraining, visually
adapted pretrained language models, or models that combine these two
approaches, more promising for such tasks? We address these questions in the
context of self-rationalization (jointly generating task labels/answers and
free-text explanations) of three tasks: (i) visual question answering in VQA-X,
(ii) visual commonsense reasoning in VCR, and (iii) visual-textual entailment
in E-SNLI-VE. We show that recent advances in each modality, CLIP image
representations and scaling of language models, do not consistently improve
multimodal self-rationalization of tasks with multimodal inputs. We also
observe that no model type works universally the best across tasks/datasets and
finetuning data sizes. Our findings call for a backbone modelling approach that
can be built on to advance text generation from images and text beyond image
captioning.
Related papers
- Instruct-Imagen: Image Generation with Multi-modal Instruction [90.04481955523514]
instruct-imagen is a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks.
We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision.
Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain.
arXiv Detail & Related papers (2024-01-03T19:31:58Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.