Parameter-Efficient Fine-Tuning of DINOv2 for Large-Scale Font Classification
- URL: http://arxiv.org/abs/2602.13889v1
- Date: Sat, 14 Feb 2026 21:15:12 GMT
- Title: Parameter-Efficient Fine-Tuning of DINOv2 for Large-Scale Font Classification
- Authors: Daniel Chen, Zaria Zinn, Marcus Lowe,
- Abstract summary: We present a font classification system capable of identifying 394 font families from rendered text images.<n>Our approach fine-tunes a DINOv2 Vision Transformer using Low-Rank Adaptation (LoRA), achieving approximately 86% top-1 accuracy while training fewer than 1% of the model's 87.2M parameters.
- Score: 0.22940141855172033
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a font classification system capable of identifying 394 font families from rendered text images. Our approach fine-tunes a DINOv2 Vision Transformer using Low-Rank Adaptation (LoRA), achieving approximately 86% top-1 accuracy while training fewer than 1% of the model's 87.2M parameters. We introduce a synthetic dataset generation pipeline that renders Google Fonts at scale with diverse augmentations including randomized colors, alignment, line wrapping, and Gaussian noise, producing training images that generalize to real-world typographic samples. The model incorporates built-in preprocessing to ensure consistency between training and inference, and is deployed as a HuggingFace Inference Endpoint. We release the model, dataset, and full training pipeline as open-source resources.
Related papers
- Towards Generalizable AI-Generated Image Detection via Image-Adaptive Prompt Learning [33.269644831847636]
Image-Adaptive Prompt Learning (IAPL) is a novel paradigm that adjusts the prompts according to each input image, rather than fixing them after training.<n>IAPL achieves state-of-the-art performance, with mean accuracies of 95.61% and 96.7% on the widely used UniversalFakeDetect and GenImage datasets.
arXiv Detail & Related papers (2025-08-03T05:41:24Z) - ROVI: A VLM-LLM Re-Captioned Dataset for Open-Vocabulary Instance-Grounded Text-to-Image Generation [23.118080583803266]
We present ROVI, a high-quality synthetic dataset for instance-grounded text-to-image generation.<n>Our key innovation is a strategy called recaptioning, focusing on the pre-detection stage.<n>For demonstrative purposes, a text-to-image model GLIGEN trained on ROVI significantly outperforms state-of-the-art alternatives in instance grounding accuracy, prompt fidelity, and aesthetic quality.
arXiv Detail & Related papers (2025-08-01T18:19:51Z) - FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [49.80911683739506]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.<n>We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.<n>Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - Learning from Synthetic Data for Visual Grounding [55.21937116752679]
We show that SynGround can improve the localization capabilities of off-the-shelf vision-and-language models.<n>Data generated with SynGround improves the pointing game accuracy of a pretrained ALBEF and BLIP models by 4.81% and 17.11% absolute percentage points, respectively.
arXiv Detail & Related papers (2024-03-20T17:59:43Z) - Diversified in-domain synthesis with efficient fine-tuning for few-shot
classification [64.86872227580866]
Few-shot image classification aims to learn an image classifier using only a small set of labeled examples per class.
We propose DISEF, a novel approach which addresses the generalization challenge in few-shot learning using synthetic data.
We validate our method in ten different benchmarks, consistently outperforming baselines and establishing a new state-of-the-art for few-shot classification.
arXiv Detail & Related papers (2023-12-05T17:18:09Z) - Image Captions are Natural Prompts for Text-to-Image Models [53.529592120988]
It is challenging for text-to-image generative models to synthesize informative training data with hand-crafted prompts.<n>We propose a simple yet effective method, validated through ImageNet classification.<n>We show that this simple caption significantly boosts the informativeness of synthetic data.
arXiv Detail & Related papers (2023-07-17T14:38:11Z) - DINOv2: Learning Robust Visual Features without Supervision [75.42921276202522]
This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources.
Most of the technical contributions aim at accelerating and stabilizing the training at scale.
In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature.
arXiv Detail & Related papers (2023-04-14T15:12:19Z) - Lafite2: Few-shot Text-to-Image Generation [132.14211027057766]
We propose a novel method for pre-training text-to-image generation model on image-only datasets.
It considers a retrieval-then-optimization procedure to synthesize pseudo text features.
It can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning.
arXiv Detail & Related papers (2022-10-25T16:22:23Z) - Generating More Pertinent Captions by Leveraging Semantics and Style on
Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z) - Scalable Font Reconstruction with Dual Latent Manifolds [55.29525824849242]
We propose a deep generative model that performs typography analysis and font reconstruction.
Our approach enables us to massively scale up the number of character types we can effectively model.
We evaluate on the task of font reconstruction over various datasets representing character types of many languages.
arXiv Detail & Related papers (2021-09-10T20:37:43Z) - On the Accuracy of CRNNs for Line-Based OCR: A Multi-Parameter
Evaluation [0.0]
We train a high quality optical character recognition (OCR) model for difficult historical typefaces on degraded paper.
We are able to obtain a 0.44% character error rate (CER) model from only 10,000 lines of training data.
We show ablations for all components of our training pipeline, which relies on the open source framework Calamari.
arXiv Detail & Related papers (2020-08-06T17:20:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.