Related papers: The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition

The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition

URL: http://arxiv.org/abs/2601.00065v1
Date: Wed, 31 Dec 2025 19:00:03 GMT
Title: The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition
Authors: Xiaoze Liu, Weichen Yu, Matt Fredrikson, Xiaoqian Wang, Jing Gao,
Abstract summary: Tokenizer transplant introduces a supply-chain vulnerability.<n>By exploiting the geometry of coefficient reuse, our attack creates an asymmetric realizability gap.<n> Empirically, the attack is training-free and achieves spectral mimicry to evade outlier detection.
Score: 31.827344197678126
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The open-weight LLM ecosystem is increasingly defined by model composition techniques (such as weight merging, speculative decoding, and vocabulary expansion) that remix capabilities from diverse sources. A critical prerequisite for applying these methods across different model families is tokenizer transplant, which aligns incompatible vocabularies to a shared embedding space. We demonstrate that this essential interoperability step introduces a supply-chain vulnerability: we engineer a single "breaker token" that is functionally inert in a donor model yet reliably reconstructs into a high-salience malicious feature after transplant into a base model. By exploiting the geometry of coefficient reuse, our attack creates an asymmetric realizability gap that sabotages the base model's generation while leaving the donor's utility statistically indistinguishable from nominal behavior. We formalize this as a dual-objective optimization problem and instantiate the attack using a sparse solver. Empirically, the attack is training-free and achieves spectral mimicry to evade outlier detection, while demonstrating structural persistence against fine-tuning and weight merging, highlighting a hidden risk in the pipeline of modular AI composition. Code is available at https://github.com/xz-liu/tokenforge

Related papers

StutterFuse: Mitigating Modality Collapse in Stuttering Detection with Jaccard-Weighted Metric Learning and Gated Fusion [0.40105987447353786]
Stuttering detection breaks down when disfluencies overlap.<n>Existing parametric models struggle to distinguish complex, simultaneous disfluencies.<n>We introduce StutterFuse, the first Retrieval-Augmented generalization (RAC) for multi-label detection.
arXiv Detail & Related papers (2025-12-15T18:28:39Z)
Bridging Symmetry and Robustness: On the Role of Equivariance in Enhancing Adversarial Robustness [9.013874391203453]
Adversarial examples reveal critical vulnerabilities in deep neural networks by exploiting their sensitivity to imperceptible input perturbations.<n>In this work, we investigate an architectural approach to adversarial robustness by embedding group-equivariant convolutions.<n>These layers encode symmetry priors that align model behavior with structured transformations in the input space, promoting smoother decision boundaries.
arXiv Detail & Related papers (2025-10-17T19:26:58Z)
Every Step Counts: Decoding Trajectories as Authorship Fingerprints of dLLMs [63.82840470917859]
We show that the decoding mechanism of dLLMs can be used as a powerful tool for model attribution.<n>We propose a novel information extraction scheme called the Directed Decoding Map (DDM), which captures structural relationships between decoding steps and better reveals model-specific behaviors.
arXiv Detail & Related papers (2025-10-02T06:25:10Z)
Exploiting Edge Features for Transferable Adversarial Attacks in Distributed Machine Learning [54.26807397329468]
This work explores a previously overlooked vulnerability in distributed deep learning systems.<n>An adversary who intercepts the intermediate features transmitted between them can still pose a serious threat.<n>We propose an exploitation strategy specifically designed for distributed settings.
arXiv Detail & Related papers (2025-07-09T20:09:00Z)
MISLEADER: Defending against Model Extraction with Ensembles of Distilled Models [56.09354775405601]
Model extraction attacks aim to replicate the functionality of a black-box model through query access.<n>Most existing defenses presume that attacker queries have out-of-distribution (OOD) samples, enabling them to detect and disrupt suspicious inputs.<n>We propose MISLEADER, a novel defense strategy that does not rely on OOD assumptions.
arXiv Detail & Related papers (2025-06-03T01:37:09Z)
CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning [12.293101110323722]
Fine-tuning-as-a-service exposes models to harmful fine-tuning attacks.<n>We propose a paradigm shift: instead of selective removal, we advocate for inducing model collapse.<n>This collapse directly neutralizes the very general capabilities that attackers exploit.
arXiv Detail & Related papers (2025-05-22T11:47:08Z)
EigenShield: Causal Subspace Filtering via Random Matrix Theory for Adversarially Robust Vision-Language Models [3.958317527488534]
Vision-Language Models (VLMs) inherit adversarial vulnerabilities of Large Language Models (LLMs)<n>Existing defenses, including adversarial training, input, and detection, are computationally expensive, architecture-dependent, and fragile against adaptive attacks.<n>We introduce EigenShield, an inference-time defense leveraging Random Matrix Theory to quantify adversarial disruptions in high-dimensional VLM representations.
arXiv Detail & Related papers (2025-02-20T19:10:51Z)
Disentanglement via Latent Quantization [60.37109712033694]
In this work, we construct an inductive bias towards encoding to and decoding from an organized latent space. We demonstrate the broad applicability of this approach by adding it to both basic data-re (vanilla autoencoder) and latent-reconstructing (InfoGAN) generative models.
arXiv Detail & Related papers (2023-05-28T06:30:29Z)
Discriminator-Free Generative Adversarial Attack [87.71852388383242]
Agenerative-based adversarial attacks can get rid of this limitation. ASymmetric Saliency-based Auto-Encoder (SSAE) generates the perturbations. The adversarial examples generated by SSAE not only make thewidely-used models collapse, but also achieves good visual quality.
arXiv Detail & Related papers (2021-07-20T01:55:21Z)
Preventing Posterior Collapse with Levenshtein Variational Autoencoder [61.30283661804425]
We propose to replace the evidence lower bound (ELBO) with a new objective which is simple to optimize and prevents posterior collapse. We show that Levenstein VAE produces more informative latent representations than alternative approaches to preventing posterior collapse.
arXiv Detail & Related papers (2020-04-30T13:27:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.