Related papers: Enhancing Robustness of Autoregressive Language Models against Orthographic Attacks via Pixel-based Approach

Enhancing Robustness of Autoregressive Language Models against Orthographic Attacks via Pixel-based Approach

URL: http://arxiv.org/abs/2508.21206v1
Date: Thu, 28 Aug 2025 20:48:38 GMT
Title: Enhancing Robustness of Autoregressive Language Models against Orthographic Attacks via Pixel-based Approach
Authors: Han Yang, Jian Lan, Yihong Liu, Hinrich Schütze, Thomas Seidl,
Abstract summary: Autoregressive language models are vulnerable to orthographic attacks.<n>This vulnerability stems from the out-of-vocabulary issue inherent in subword tokenizers and their embeddings.<n>We propose a pixel-based generative language model that replaces the text-based embeddings with pixel-based representations by rendering words as individual images.
Score: 51.95266411355865
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive language models are vulnerable to orthographic attacks, where input text is perturbed with characters from multilingual alphabets, leading to substantial performance degradation. This vulnerability primarily stems from the out-of-vocabulary issue inherent in subword tokenizers and their embeddings. To address this limitation, we propose a pixel-based generative language model that replaces the text-based embeddings with pixel-based representations by rendering words as individual images. This design provides stronger robustness to noisy inputs, while an extension of compatibility to multilingual text across diverse writing systems. We evaluate the proposed method on the multilingual LAMBADA dataset, WMT24 dataset and the SST-2 benchmark, demonstrating both its resilience to orthographic noise and its effectiveness in multilingual settings.

Related papers

Does Visual Rendering Bypass Tokenization? Investigating Script-Tokenizer Misalignment in Pixel-Based Language Models [20.181240222544208]
multimodal variants such as DualGPT reintroduce text tokenizers to improve autoregressive performance.<n>We investigate a fundamental question, does visual rendering truly decouple a model from tokenization constraints?<n>Our results show that, despite visual rendering, reintegrating a text tokenizer into the architecture reintroduces the same issue that pixel-based language modeling aims to resolve.
arXiv Detail & Related papers (2026-01-12T07:37:46Z)
Overcoming Vocabulary Constraints with Pixel-level Fallback [9.753745943931207]
Subword tokenization requires balancing computational efficiency and vocabulary coverage.<n>We propose a vocabulary-free encoder that generates input embeddings from text rendered as pixels.
arXiv Detail & Related papers (2025-04-02T20:50:31Z)
Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language. In this paper, we propose an end-to-end IIMT model consisting of four modules. Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z)
Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases. Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z)
Synchronizing Vision and Language: Bidirectional Token-Masking AutoEncoder for Referring Image Segmentation [26.262887028563163]
Referring Image (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level. We propose a novel bidirectional token-masking autoencoder (BTMAE) inspired by the masked autoencoder (MAE) BTMAE learns the context of image-to-language and language-to-image by reconstructing missing features in both image and language features at the token level.
arXiv Detail & Related papers (2023-11-29T07:33:38Z)
Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation [82.5217996570387]
We adapt a pre-trained language model for auto-regressive text-to-image generation. We find that pre-trained language models offer limited help.
arXiv Detail & Related papers (2023-11-27T07:19:26Z)
Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens [76.40196364163663]
We propose a learning-based vision-language pre-training approach, such as CLIP. We show that our method can learn more comprehensive representations and capture meaningful cross-modal correspondence.
arXiv Detail & Related papers (2023-03-27T00:58:39Z)
To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP [0.0]
We investigate three categories of text augmentation methodologies which perform changes on the syntax. We compare them on part-of-speech tagging, dependency parsing and semantic role labeling for a diverse set of language families. Our results suggest that the augmentation techniques can further improve over strong baselines based on mBERT.
arXiv Detail & Related papers (2021-11-18T10:52:48Z)
Robust Open-Vocabulary Translation from Visual Text Representations [15.646399508495133]
Machine translation models have discrete and commonly 'open-vocabulary' subword segmentation techniques. This approach relies on consistent and correct underlying vocabularies. Motivated by human language processing, we propose the use of visual text representations.
arXiv Detail & Related papers (2021-04-16T16:37:13Z)
Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements. We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations. Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.