Texture or Semantics? Vision-Language Models Get Lost in Font Recognition
- URL: http://arxiv.org/abs/2503.23768v1
- Date: Mon, 31 Mar 2025 06:33:21 GMT
- Title: Texture or Semantics? Vision-Language Models Get Lost in Font Recognition
- Authors: Zhecheng Li, Guoxian Song, Yujun Cai, Zhen Xiong, Junsong Yuan, Yiwei Wang,
- Abstract summary: We introduce the Font Recognition Benchmark (FRB), a compact and well-structured dataset comprising 15 commonly used fonts.<n> FRB includes two versions: (i) an easy version, where 10 sentences are rendered in different fonts, and (ii) a hard version, where each text sample consists of the names of the 15 fonts themselves.<n>We find that current VLMs exhibit limited font recognition capabilities, with many state-of-the-art models failing to achieve satisfactory performance.
- Score: 48.856390495568114
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern Vision-Language Models (VLMs) exhibit remarkable visual and linguistic capabilities, achieving impressive performance in various tasks such as image recognition and object localization. However, their effectiveness in fine-grained tasks remains an open question. In everyday scenarios, individuals encountering design materials, such as magazines, typography tutorials, research papers, or branding content, may wish to identify aesthetically pleasing fonts used in the text. Given their multimodal capabilities and free accessibility, many VLMs are often considered potential tools for font recognition. This raises a fundamental question: Do VLMs truly possess the capability to recognize fonts? To investigate this, we introduce the Font Recognition Benchmark (FRB), a compact and well-structured dataset comprising 15 commonly used fonts. FRB includes two versions: (i) an easy version, where 10 sentences are rendered in different fonts, and (ii) a hard version, where each text sample consists of the names of the 15 fonts themselves, introducing a stroop effect that challenges model perception. Through extensive evaluation of various VLMs on font recognition tasks, we arrive at the following key findings: (i) Current VLMs exhibit limited font recognition capabilities, with many state-of-the-art models failing to achieve satisfactory performance. (ii) Few-shot learning and Chain-of-Thought (CoT) prompting provide minimal benefits in improving font recognition accuracy across different VLMs. (iii) Attention analysis sheds light on the inherent limitations of VLMs in capturing semantic features.
Related papers
- TextInVision: Text and Prompt Complexity Driven Visual Text Generation Benchmark [61.412934963260724]
Existing diffusion-based text-to-image models often struggle to accurately embed text within images.
We introduce TextInVision, a large-scale, text and prompt complexity driven benchmark to evaluate the ability of diffusion models to integrate visual text into images.
arXiv Detail & Related papers (2025-03-17T21:36:31Z) - One-Shot Multilingual Font Generation Via ViT [2.023301270280465]
Font design poses unique challenges for logographic languages like Chinese, Japanese, and Korean.
This paper introduces a novel Vision Transformer (ViT)-based model for multi-language font generation.
arXiv Detail & Related papers (2024-12-15T23:52:35Z) - Visual Perception in Text Strings [24.60102607739684]
In this work, we select ASCII art as a representative artifact, where the lines and brightness used to depict each concept are rendered by characters.
We benchmark model performance on this task by constructing an evaluation dataset and also collect a training set to elicit the models' visual perception ability.
Results reveal that although humans can achieve nearly 100% accuracy, the state-of-the-art LLMs and MLLMs lag far behind.
arXiv Detail & Related papers (2024-10-02T16:46:01Z) - Attention Prompting on Image for Large Vision-Language Models [63.794304207664176]
We propose a new prompting technique named Attention Prompting on Image.
We generate an attention heatmap for the input image dependent on the text query with an auxiliary model like CLIP.
Experiments on various vison-language benchmarks verify the effectiveness of our technique.
arXiv Detail & Related papers (2024-09-25T17:59:13Z) - StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond [68.0107158115377]
We have crafted an efficient vision-language model, StrucTexTv3, tailored to tackle various intelligent tasks for text-rich images.
We enhance the perception and comprehension abilities of StrucTexTv3 through instruction learning.
Our method achieved SOTA results in text-rich image perception tasks, and significantly improved performance in comprehension tasks.
arXiv Detail & Related papers (2024-05-31T16:55:04Z) - FontCLIP: A Semantic Typography Visual-Language Model for Multilingual
Font Applications [27.609008096617057]
FontCLIP is a model that connects the semantic understanding of a large vision-language model with typographical knowledge.
We integrate typography-specific knowledge into the comprehensive vision-language knowledge of a pretrained CLIP model.
FontCLIP's dual-modality and generalization abilities enable multilingual and cross-lingual font retrieval and letter shape optimization.
arXiv Detail & Related papers (2024-03-11T06:08:16Z) - Enhancing Visual Document Understanding with Contrastive Learning in
Large Visual-Language Models [56.76307866160105]
We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo)
DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs)
We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
arXiv Detail & Related papers (2024-02-29T10:17:27Z) - Representing Online Handwriting for Recognition in Large Vision-Language
Models [8.344510330567495]
We propose a novel tokenized representation of digital ink (online handwriting) that includes both a time-ordered sequence of strokes as text, and as image.
We show that this representation yields results comparable to or better than state-of-the-art online handwriting recognizers.
arXiv Detail & Related papers (2024-02-23T13:11:10Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z) - Visually-augmented pretrained language models for NLP tasks without
images [77.74849855049523]
Existing solutions often rely on explicit images for visual knowledge augmentation.
We propose a novel textbfVisually-textbfAugmented fine-tuning approach.
Our approach can consistently improve the performance of BERT, RoBERTa, BART, and T5 at different scales.
arXiv Detail & Related papers (2022-12-15T16:13:25Z) - HENet: Forcing a Network to Think More for Font Recognition [10.278412487287882]
This paper proposes a novel font recognizer with a pluggable module solving the font recognition task.
The pluggable module hides the most discriminative accessible features and forces the network to consider other complicated features to solve the hard examples of similar fonts, called HE Block.
arXiv Detail & Related papers (2021-10-21T03:25:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.