Related papers: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

URL: http://arxiv.org/abs/2111.14447v1
Date: Mon, 29 Nov 2021 11:01:49 GMT
Title: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic
Authors: Yoad Tewel, Yoav Shalev, Idan Schwartz, Lior Wolf
Abstract summary: Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. In this work, we repurpose such models to generate a descriptive text given an image at inference time. The resulting captions are much less restrictive than those obtained by supervised captioning methods.
Score: 72.60554897161948
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. While such models can provide a powerful score for matching and subsequent zero-shot tasks, they are not capable of generating caption given an image. In this work, we repurpose such models to generate a descriptive text given an image at inference time, without any further training or tuning step. This is done by combining the visual-semantic model with a large language model, benefiting from the knowledge in both web-scale models. The resulting captions are much less restrictive than those obtained by supervised captioning methods. Moreover, as a zero-shot learning method, it is extremely flexible and we demonstrate its ability to perform image arithmetic in which the inputs can be either images or text and the output is a sentence. This enables novel high-level vision capabilities such as comparing two images or solving visual analogy tests.

Related papers

Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation [70.95783968368124]
We introduce a novel multi-modal autoregressive model, dubbed $textbfInstaManip$. We propose an innovative group self-attention mechanism to break down the in-context learning process into two separate stages. Our method surpasses previous few-shot image manipulation models by a notable margin.
arXiv Detail & Related papers (2024-12-02T01:19:21Z)
Conditional Text-to-Image Generation with Reference Guidance [81.99538302576302]
This paper explores using additional conditions of an image that provides visual guidance of the particular subjects for diffusion models to generate. We develop several small-scale expert plugins that efficiently endow a Stable Diffusion model with the capability to take different references. Our expert plugins demonstrate superior results than the existing methods on all tasks, each containing only 28.55M trainable parameters.
arXiv Detail & Related papers (2024-11-22T21:38:51Z)
See or Guess: Counterfactually Regularized Image Captioning [32.82695612178604]
We present a generic image captioning framework that employs causal inference to make existing models more capable of interventional tasks, and counterfactually explainable. Our method effectively reduces hallucinations and improves the model's faithfulness to images, demonstrating high portability across both small-scale and large-scale image-to-text models.
arXiv Detail & Related papers (2024-08-29T17:59:57Z)
Hypernymy Understanding Evaluation of Text-to-Image Models via WordNet Hierarchy [12.82992353036576]
We measure the capability of popular text-to-image models to understand $textithypernymy$, or the "is-a" relation between words. We show how our metrics can provide a better understanding of the individual strengths and weaknesses of popular text-to-image models.
arXiv Detail & Related papers (2023-10-13T16:53:25Z)
ITI-GEN: Inclusive Text-to-Image Generation [56.72212367905351]
This study investigates inclusive text-to-image generative models that generate images based on human-written prompts. We show that, for some attributes, images can represent concepts more expressively than text. We propose a novel approach, ITI-GEN, that leverages readily available reference images for Inclusive Text-to-Image GENeration.
arXiv Detail & Related papers (2023-09-11T15:54:30Z)
Discriminative Class Tokens for Text-to-Image Diffusion Models [107.98436819341592]
We propose a non-invasive fine-tuning technique that capitalizes on the expressive potential of free-form text. Our method is fast compared to prior fine-tuning methods and does not require a collection of in-class images. We evaluate our method extensively, showing that the generated images are: (i) more accurate and of higher quality than standard diffusion models, (ii) can be used to augment training data in a low-resource setting, and (iii) reveal information about the data used to train the guiding classifier.
arXiv Detail & Related papers (2023-03-30T05:25:20Z)
I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision [32.49636188029509]
We produce models using only text training data on four representative tasks. We find these models perform close to models trained on images. We showcase a variety of stylistic image captioning models that are trained using no image data and no human-curated language data.
arXiv Detail & Related papers (2022-11-17T18:52:19Z)
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs. Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z)
Caption Enriched Samples for Improving Hateful Memes Detection [78.5136090997431]
The hateful meme challenge demonstrates the difficulty of determining whether a meme is hateful or not. Both unimodal language models and multimodal vision-language models cannot reach the human level of performance.
arXiv Detail & Related papers (2021-09-22T10:57:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.