Related papers: Generative Visual Communication in the Era of Vision-Language Models

Generative Visual Communication in the Era of Vision-Language Models

URL: http://arxiv.org/abs/2411.18727v1
Date: Wed, 27 Nov 2024 20:04:31 GMT
Title: Generative Visual Communication in the Era of Vision-Language Models
Authors: Yael Vinker,
Abstract summary: In today's visually saturated world, effective design demands an understanding of graphic design principles.<n>This dissertation explores how recent advancements in vision-language models can be leveraged to automate the creation of effective visual communication designs.
Score: 9.229067992381763
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Visual communication, dating back to prehistoric cave paintings, is the use of visual elements to convey ideas and information. In today's visually saturated world, effective design demands an understanding of graphic design principles, visual storytelling, human psychology, and the ability to distill complex information into clear visuals. This dissertation explores how recent advancements in vision-language models (VLMs) can be leveraged to automate the creation of effective visual communication designs. Although generative models have made great progress in generating images from text, they still struggle to simplify complex ideas into clear, abstract visuals and are constrained by pixel-based outputs, which lack flexibility for many design tasks. To address these challenges, we constrain the models' operational space and introduce task-specific regularizations. We explore various aspects of visual communication, namely, sketches and visual abstraction, typography, animation, and visual inspiration.

Related papers

Chatting with Images for Introspective Visual Thinking [50.7747647794877]
''Chatting with images'' is a new framework that reframes visual manipulation as language-guided feature modulation.<n>Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions.<n>ViLaVT achieves strong and consistent improvements on complex multi-image and video-based spatial reasoning tasks.
arXiv Detail & Related papers (2026-02-11T17:42:37Z)
Visual Planning: Let's Think Only with Images [30.67065689757505]
We argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information.<n>Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations, independent of text.<n>In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions.
arXiv Detail & Related papers (2025-05-16T16:17:22Z)
Natural Language Generation from Visual Events: Challenges and Future Directions [8.058451580903123]
We argue that any NLG task dealing with sequences of images or frames is an instance of the broader, more general problem of modeling the intricate relationships between visual events unfolding over time.<n>We consider five seemingly different tasks, which we argue are compelling instances of this broader multimodal problem.<n>We claim that improving language-and-vision models' understanding of visual events is both timely and essential, given their growing applications.
arXiv Detail & Related papers (2025-02-18T16:48:18Z)
What Makes a Maze Look Like a Maze? [92.80800000328277]
We introduce Deep Grounding (DSG), a framework that leverages explicit structured representations of visual abstractions for grounding and reasoning. At the core of DSG are schemas--dependency graph descriptions of abstract concepts that decompose them into more primitive-level symbols. We show that DSG significantly improves the abstract visual reasoning performance of vision-language models.
arXiv Detail & Related papers (2024-09-12T16:41:47Z)
Using Left and Right Brains Together: Towards Vision and Language Planning [95.47128850991815]
We introduce a novel vision-language planning framework to perform concurrent visual and language planning for tasks with inputs of any form. We evaluate the effectiveness of our framework across vision-language tasks, vision-only tasks, and language-only tasks.
arXiv Detail & Related papers (2024-02-16T09:46:20Z)
A Vision Check-up for Language Models [61.852026871772914]
We show how a preliminary visual representation learning system can be trained using models of text. Experiments on self-supervised visual representation learning highlight the potential to train vision models capable of making semantic assessments of natural images.
arXiv Detail & Related papers (2024-01-03T18:09:33Z)
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding [55.65727739645824]
Chat-UniVi is a Unified Vision-language model capable of comprehending and engaging in conversations involving images and videos. We employ a set of dynamic visual tokens to uniformly represent images and videos. We leverage a multi-scale representation, enabling the model to perceive both high-level semantic concepts and low-level visual details.
arXiv Detail & Related papers (2023-11-14T10:11:36Z)
Text-to-Image Generation for Abstract Concepts [76.32278151607763]
We propose a framework of Text-to-Image generation for Abstract Concepts (TIAC) The abstract concept is clarified into a clear intent with a detailed definition to avoid ambiguity. The concept-dependent form is retrieved from an LLM-extracted form pattern set.
arXiv Detail & Related papers (2023-09-26T02:22:39Z)
Look, Remember and Reason: Grounded reasoning in videos with language models [5.3445140425713245]
Multi-temporal language models (LM) have recently shown promising performance in high-level reasoning tasks on videos. We propose training an LM end-to-end on low-level surrogate tasks, including object detection, re-identification, tracking, to endow the model with the required low-level visual capabilities. We demonstrate the effectiveness of our framework on diverse visual reasoning tasks from the ACRE, CATER, Something-Else and STAR datasets.
arXiv Detail & Related papers (2023-06-30T16:31:14Z)
Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models [24.456117679941816]
Contrastive Reading Model (Cream) is a novel neural architecture designed to enhance the language-image understanding capability of Large Language Models (LLMs) Our approach bridges the gap between vision and language understanding, paving the way for the development of more sophisticated Document Intelligence Assistants.
arXiv Detail & Related papers (2023-05-24T11:59:13Z)
Vision-Language Models in Remote Sensing: Current Progress and Future Trends [25.017685538386548]
Vision-language models enable reasoning about images and their associated textual descriptions, allowing for a deeper understanding of the underlying semantics. Vision-language models can go beyond visual recognition of RS images, model semantic relationships, and generate natural language descriptions of the image. This paper provides a comprehensive review of the research on vision-language models in remote sensing.
arXiv Detail & Related papers (2023-05-09T19:17:07Z)
GAMR: A Guided Attention Model for (visual) Reasoning [7.919213739992465]
Humans continue to outperform modern AI systems in their ability to flexibly parse and understand complex visual scenes. We present a novel module for visual reasoning, the Guided Attention Model for (visual) Reasoning (GAMR) GAMR posits that the brain solves complex visual reasoning problems dynamically via sequences of attention shifts to select and route task-relevant visual information into memory.
arXiv Detail & Related papers (2022-06-10T07:52:06Z)
K-LITE: Learning Transferable Visual Models with External Knowledge [242.3887854728843]
K-LITE (Knowledge-augmented Language-Image Training and Evaluation) is a strategy to leverage external knowledge to build transferable visual systems. In training, it enriches entities in natural language with WordNet and Wiktionary knowledge. In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts.
arXiv Detail & Related papers (2022-04-20T04:47:01Z)
Enabling Robots to Draw and Tell: Towards Visually Grounded Multimodal Description Generation [1.52292571922932]
Socially competent robots should be equipped with the ability to perceive the world that surrounds them and communicate about it in a human-like manner. Representative skills that exhibit such ability include generating image descriptions and visually grounded referring expressions. We propose to model the task of generating natural language together with free-hand sketches/hand gestures to describe visual scenes and real life objects.
arXiv Detail & Related papers (2021-01-14T23:40:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.