Impression-CLIP: Contrastive Shape-Impression Embedding for Fonts
- URL: http://arxiv.org/abs/2402.16350v1
- Date: Mon, 26 Feb 2024 07:07:18 GMT
- Title: Impression-CLIP: Contrastive Shape-Impression Embedding for Fonts
- Authors: Yugo Kubota, Daichi Haraguchi, Seiichi Uchida
- Abstract summary: We propose Impression-CLIP, which is a novel machine-learning model based on CLIP (Contrastive Language-Image Pre-training)
In our experiment, we perform cross-modal retrieval between fonts and impressions through co-embedding.
The results indicate that Impression-CLIP achieves better retrieval accuracy than the state-of-the-art method.
- Score: 7.542892664684078
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fonts convey different impressions to readers. These impressions often come
from the font shapes. However, the correlation between fonts and their
impression is weak and unstable because impressions are subjective. To capture
such weak and unstable cross-modal correlation between font shapes and their
impressions, we propose Impression-CLIP, which is a novel machine-learning
model based on CLIP (Contrastive Language-Image Pre-training). By using the
CLIP-based model, font image features and their impression features are pulled
closer, and font image features and unrelated impression features are pushed
apart. This procedure realizes co-embedding between font image and their
impressions. In our experiment, we perform cross-modal retrieval between fonts
and impressions through co-embedding. The results indicate that Impression-CLIP
achieves better retrieval accuracy than the state-of-the-art method.
Additionally, our model shows the robustness to noise and missing tags.
Related papers
- TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z) - Khattat: Enhancing Readability and Concept Representation of Semantic Typography [0.3994968615706021]
semantic typography involves selecting an idea, choosing an appropriate font, and balancing creativity with readability.
We introduce an end-to-end system that automates this process.
Key feature is our OCR-based loss function, which enhances readability and enables simultaneous stylization of multiple characters.
arXiv Detail & Related papers (2024-10-01T18:42:48Z) - Finetuning CLIP to Reason about Pairwise Differences [52.028073305958074]
We propose an approach to train vision-language models such as CLIP in a contrastive manner to reason about differences in embedding space.
We first demonstrate that our approach yields significantly improved capabilities in ranking images by a certain attribute.
We also illustrate that the resulting embeddings obey a larger degree of geometric properties in embedding space.
arXiv Detail & Related papers (2024-09-15T13:02:14Z) - GRIF-DM: Generation of Rich Impression Fonts using Diffusion Models [18.15911470339845]
We introduce a diffusion-based method, termed ourmethod, to generate fonts that vividly embody specific impressions.
Our experimental results, conducted on the MyFonts dataset, affirm that this method is capable of producing realistic, vibrant, and high-fidelity fonts.
arXiv Detail & Related papers (2024-08-14T02:26:46Z) - Font Impression Estimation in the Wild [7.542892664684078]
We use a font dataset with annotation about font impressions and a convolutional neural network (CNN) framework for this task.
We propose an exemplar-based impression estimation approach, which relies on a strategy of ensembling impressions of exemplar fonts that are similar to the input image.
We conduct a correlation analysis between book genres and font impressions on real book cover images.
arXiv Detail & Related papers (2024-02-23T10:00:25Z) - Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases.
Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z) - VQ-Font: Few-Shot Font Generation with Structure-Aware Enhancement and
Quantization [52.870638830417]
We propose a VQGAN-based framework (i.e., VQ-Font) to enhance glyph fidelity through token prior refinement and structure-aware enhancement.
Specifically, we pre-train a VQGAN to encapsulate font token prior within a codebook. Subsequently, VQ-Font refines the synthesized glyphs with the codebook to eliminate the domain gap between synthesized and real-world strokes.
arXiv Detail & Related papers (2023-08-27T06:32:20Z) - DGFont++: Robust Deformable Generative Networks for Unsupervised Font
Generation [19.473023811252116]
We propose a robust deformable generative network for unsupervised font generation (abbreviated as DGFont++)
To distinguish different styles, we train our model with a multi-task discriminator, which ensures that each style can be discriminated independently.
Experiments demonstrate that our model is able to generate character images of higher quality than state-of-the-art methods.
arXiv Detail & Related papers (2022-12-30T14:35:10Z) - Generating More Pertinent Captions by Leveraging Semantics and Style on
Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z) - Scalable Font Reconstruction with Dual Latent Manifolds [55.29525824849242]
We propose a deep generative model that performs typography analysis and font reconstruction.
Our approach enables us to massively scale up the number of character types we can effectively model.
We evaluate on the task of font reconstruction over various datasets representing character types of many languages.
arXiv Detail & Related papers (2021-09-10T20:37:43Z) - Shared Latent Space of Font Shapes and Impressions [9.205278113241473]
We realize a shared latent space where a font shape image and its impression words are embedded in a cross-modal manner.
This latent space is useful to understand the style-impression correlation and generate font images by specifying several impression words.
arXiv Detail & Related papers (2021-03-23T06:54:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.