Visual Conceptual Blending with Large-scale Language and Vision Models
- URL: http://arxiv.org/abs/2106.14127v1
- Date: Sun, 27 Jun 2021 02:48:39 GMT
- Title: Visual Conceptual Blending with Large-scale Language and Vision Models
- Authors: Songwei Ge and Devi Parikh
- Abstract summary: We generate a single-sentence description of the blend of the two using a language model.
We generate a visual depiction of the blend using a text-based image generation model.
- Score: 54.251383721475655
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We ask the question: to what extent can recent large-scale language and image
generation models blend visual concepts? Given an arbitrary object, we identify
a relevant object and generate a single-sentence description of the blend of
the two using a language model. We then generate a visual depiction of the
blend using a text-based image generation model. Quantitative and qualitative
evaluations demonstrate the superiority of language models over classical
methods for conceptual blending, and of recent large-scale image generation
models over prior models for the visual depiction.
Related papers
- Elucidating the design space of language models for image generation [13.96798987912677]
We show that image tokens exhibit greater randomness compared to text tokens, which presents challenges when training with token prediction.
Our analysis also reveals that while all models successfully grasp the importance of local information in image generation, smaller models struggle to capture the global context.
Our work is the first to analyze the optimization behavior of language models in vision generation, and we believe it can inspire more effective designs when applying LMs to other domains.
arXiv Detail & Related papers (2024-10-21T17:57:04Z) - OCC-MLLM:Empowering Multimodal Large Language Model For the Understanding of Occluded Objects [2.850097504458451]
We introduce a novel multimodal model that applies a newly designed visual encoder to understand occluded objects in RGB images.
We also introduce a large-scale visual-language pair dataset for training large-scale visual-language multimodal models.
arXiv Detail & Related papers (2024-10-02T06:14:49Z) - Reading Is Believing: Revisiting Language Bottleneck Models for Image Classification [4.1205832766381985]
We revisit language bottleneck models as an approach to ensuring the explainability of deep learning models for image classification.
We experimentally show that a language bottleneck model that combines a modern image captioner with a pre-trained language model can achieve image classification accuracy that exceeds that of black-box models.
arXiv Detail & Related papers (2024-06-22T10:49:34Z) - Bridging Different Language Models and Generative Vision Models for
Text-to-Image Generation [12.024554708901514]
We propose LaVi-Bridge, a pipeline that enables the integration of diverse pre-trained language models and generative vision models for text-to-image generation.
Our pipeline is compatible with various language models and generative vision models, accommodating different structures.
arXiv Detail & Related papers (2024-03-12T17:50:11Z) - A Vision Check-up for Language Models [61.852026871772914]
We show how a preliminary visual representation learning system can be trained using models of text.
Experiments on self-supervised visual representation learning highlight the potential to train vision models capable of making semantic assessments of natural images.
arXiv Detail & Related papers (2024-01-03T18:09:33Z) - Multilingual Conceptual Coverage in Text-to-Image Models [98.80343331645626]
"Conceptual Coverage Across Languages" (CoCo-CroLa) is a technique for benchmarking the degree to which any generative text-to-image system provides multilingual parity to its training language in terms of tangible nouns.
For each model we can assess "conceptual coverage" of a given target language relative to a source language by comparing the population of images generated for a series of tangible nouns in the source language to the population of images generated for each noun under translation in the target language.
arXiv Detail & Related papers (2023-06-02T17:59:09Z) - Visual Clues: Bridging Vision and Language Foundations for Image
Paragraph Captioning [78.07495777674747]
We argue that by using visual clues to bridge large pretrained vision foundation models and language models, we can do so without any extra cross-modal training.
Thanks to the strong zero-shot capability of foundation models, we start by constructing a rich semantic representation of the image.
We use large language model to produce a series of comprehensive descriptions for the visual content, which is then verified by the vision model again to select the candidate that aligns best with the image.
arXiv Detail & Related papers (2022-06-03T22:33:09Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - DALL-Eval: Probing the Reasoning Skills and Social Biases of
Text-to-Image Generation Models [73.12069620086311]
We investigate the visual reasoning capabilities and social biases of text-to-image models.
First, we measure three visual reasoning skills: object recognition, object counting, and spatial relation understanding.
Second, we assess the gender and skin tone biases by measuring the gender/skin tone distribution of generated images.
arXiv Detail & Related papers (2022-02-08T18:36:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.