Related papers: Towards Visual Text Design Transfer Across Languages

Towards Visual Text Design Transfer Across Languages

URL: http://arxiv.org/abs/2410.18823v2
Date: Tue, 29 Oct 2024 08:24:22 GMT
Title: Towards Visual Text Design Transfer Across Languages
Authors: Yejin Choi, Jiwan Chung, Sumin Shim, Giyeong Oh, Youngjae Yu,
Abstract summary: We introduce a novel task of Multimodal Style Translation (MuST-Bench) MuST-Bench is a benchmark designed to evaluate the ability of visual text generation models to perform translation across different writing systems. In response, we introduce SIGIL, a framework for multimodal style translation that eliminates the need for style descriptions.
Score: 49.78504488452978
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual text design plays a critical role in conveying themes, emotions, and atmospheres in multimodal formats such as film posters and album covers. Translating these visual and textual elements across languages extends the concept of translation beyond mere text, requiring the adaptation of aesthetic and stylistic features. To address this, we introduce a novel task of Multimodal Style Translation (MuST-Bench), a benchmark designed to evaluate the ability of visual text generation models to perform translation across different writing systems while preserving design intent. Our initial experiments on MuST-Bench reveal that existing visual text generation models struggle with the proposed task due to the inadequacy of textual descriptions in conveying visual design. In response, we introduce SIGIL, a framework for multimodal style translation that eliminates the need for style descriptions. SIGIL enhances image generation models through three innovations: glyph latent for multilingual settings, pretrained VAEs for stable style guidance, and an OCR model with reinforcement learning feedback for optimizing readable character generation. SIGIL outperforms existing baselines by achieving superior style consistency and legibility while maintaining visual fidelity, setting itself apart from traditional description-based approaches. We release MuST-Bench publicly for broader use and exploration https://huggingface.co/datasets/yejinc/MuST-Bench.

Related papers

Context-Adaptive Multi-Prompt Embedding with Large Language Models for Vision-Language Alignment [33.152772648399846]
We propose a novel approach to enrich semantic representations in vision-language contrastive learning.<n>We leverage a pretrained LLM as the text encoder within the CLIP framework, processing all prompts jointly in a single forward pass.<n>The resulting prompt embeddings are combined into a unified text representation, enabling semantically richer alignment with visual features.
arXiv Detail & Related papers (2025-08-03T20:48:43Z)
DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models [115.62816053600085]
We present DesignDiffusion, a framework for synthesizing design images from textual descriptions. The proposed framework directly synthesizes textual and visual design elements from user prompts. It utilizes a distinctive character embedding derived from the visual text to enhance the input prompt.
arXiv Detail & Related papers (2025-03-03T15:22:57Z)
Bringing Characters to New Stories: Training-Free Theme-Specific Image Generation via Dynamic Visual Prompting [71.29100512700064]
We present T-Prompter, a training-free method for theme-specific image generation. T-Prompter integrates reference images into generative models, allowing users to seamlessly specify the target theme. Our approach enables consistent story generation, character design, realistic character generation, and style-guided image generation.
arXiv Detail & Related papers (2025-01-26T19:01:19Z)
One-Shot Multilingual Font Generation Via ViT [2.023301270280465]
Font design poses unique challenges for logographic languages like Chinese, Japanese, and Korean. This paper introduces a novel Vision Transformer (ViT)-based model for multi-language font generation.
arXiv Detail & Related papers (2024-12-15T23:52:35Z)
ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models [52.23899502520261]
We introduce a new framework named ARTIST to focus on the learning of text structures. We finetune a visual diffusion model, enabling it to assimilate textual structure information from the pretrained textual model. Empirical results on the MARIO-Eval benchmark underscore the effectiveness of the proposed method, showing an improvement of up to 15% in various metrics.
arXiv Detail & Related papers (2024-06-17T19:31:24Z)
AnyTrans: Translate AnyText in the Image with Large Scale Models [88.5887934499388]
This paper introduces AnyTrans, an all-encompassing framework for the task-Translate AnyText in the Image (TATI) Our framework incorporates contextual cues from both textual and visual elements during translation. We have meticulously compiled a test dataset called MTIT6, which consists of multilingual text image translation data from six language pairs.
arXiv Detail & Related papers (2024-06-17T11:37:48Z)
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language. The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image. This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z)
ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation [35.05755930636518]
We propose ViLTA, comprising of two components to further facilitate the model to learn fine-grained representations among image-text pairs. For Masked Language Modeling (MLM), we propose a cross-distillation method to generate soft labels to enhance the robustness of model. For Image-Text Matching (ITM), we leverage the current language encoder to synthesize hard negatives based on the context of language input.
arXiv Detail & Related papers (2023-08-31T12:46:36Z)
Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models [24.456117679941816]
Contrastive Reading Model (Cream) is a novel neural architecture designed to enhance the language-image understanding capability of Large Language Models (LLMs) Our approach bridges the gap between vision and language understanding, paving the way for the development of more sophisticated Document Intelligence Assistants.
arXiv Detail & Related papers (2023-05-24T11:59:13Z)
StoryTrans: Non-Parallel Story Author-Style Transfer with Discourse Representations and Content Enhancing [73.81778485157234]
Long texts usually involve more complicated author linguistic preferences such as discourse structures than sentences. We formulate the task of non-parallel story author-style transfer, which requires transferring an input story into a specified author style. We use an additional training objective to disentangle stylistic features from the learned discourse representation to prevent the model from degenerating to an auto-encoder.
arXiv Detail & Related papers (2022-08-29T08:47:49Z)
Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model. During the training phase, the modality transition network is optimised by the proposed modality loss. Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.