Fugu-MT 論文翻訳(概要): OTRO: Oblivious Tokenization Path with Square-Root ORAM

論文の概要: OTRO: Oblivious Tokenization Path with Square-Root ORAM

arxiv url: http://arxiv.org/abs/2606.17358v2
Date: Tue, 23 Jun 2026 01:32:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 22:16:48.218643
Title: OTRO: Oblivious Tokenization Path with Square-Root ORAM
Title（参考訳）: OTRO:Square-Root ORAMを使った素晴らしいトークン化パス
Authors: Jonghyun Lee, Yongqin Wang, Rachit Rajat, Daniel Wong, Mengyuan Li, Murali Annavaram,
Abstract要約: 本稿では,遅延クリティカルなLCMサービスに適した,効率的で難解なトークン化パスOTROを提案する。 OTROは、高速なシングルアクセスルックアップのために平方根のORAMを頼りにしているが、その禁止的な$O(Nlog2N$)再構築コストは、アクセス毎に$sqrtN$である。 OTROはTTFTのオーバーヘッドを少なくとも4.5%に制限し、トークン化によるレイテンシを全TTFTの10%以下に抑え、0.5GB以上のメモリオーバーヘッドを追加する。
参考スコア（独自算出の注目度）: 16.989159913127818
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The CPU-side large language model (LLM) tokenizer is a critical security gap in LLM serving through a confidential computing stack with CPU and GPU trusted execution environments (TEEs). Tokenizers converts the prompts through table-driven lookups, and the resulting memory access patterns are a powerful source of side-channel leakage. Recent work demonstrates end-to-end recovery of user prompts from tokenizer access pattern on production Intel TDX. However, a drop-in use of the popular tree-based Oblivious RAMs (e.g., PathORAM) to prevent access-pattern leakage introduces $\sim$13$\times$ tokenizer slowdown, resulting in 10-58% higher time-to-first-token (TTFT). In this paper, we present OTRO, an efficient, oblivious tokenization path tailored to latency-critical LLM serving. OTRO relies on square-root ORAM for fast single-access lookups, but avoids its prohibitive $O(N\log^2N$) rebuild cost every $\sqrt{N}$ accesses through three key innovations. First, OTRO provides a pool of replicated square-root ORAM instances that utilize the read-only nature of tokenizer table. Second, an epoch-based rotation policy decouples accesses from rebuilds and pads each epoch with dummy accesses to its boundaries, minimizing observable information. Lastly, chunked KV-cache-aware tokenization further overlaps rebuilds with GPU prefill and minimizes the instance count. Implemented as modules in HuggingFace Tokenizers and nano-vLLM, running within a TDX-enabled CVM with an NVIDIA H100 GPU, OTRO limits TTFT overhead to at most 4.5%, keeps tokenizer-induced latency under 10\% of total TTFT, and adds less than 0.5 GB of memory overhead while reducing the tokenizer's observable leakage across various model families and sizes.
Abstract（参考訳）: CPUサイドの大規模言語モデル(LLM)トークンライザは、CPUとGPU信頼できる実行環境(TEE)を備えた機密計算スタックを通じてLLMが提供する重要なセキュリティギャップである。トケナイザは、テーブル駆動のルックアップを通じてプロンプトを変換し、結果として生じるメモリアクセスパターンは、サイドチャネルリークの強力な源である。最近の研究は、Intel TDXのトークン化器アクセスパターンからのユーザプロンプトのエンドツーエンド回復を実証している。しかし、アクセスパターンのリークを防ぐために人気のツリーベースのOblivious RAM(例:PathORAM)をドロップインで使用すると、$\sim$13$\times$トークンライザのスローダウンが発生し、TTFT(Time-to-first-token)は10～58%高くなる。本稿では,遅延クリティカルなLCMサービスに適した,効率的なトークン化パスであるOTROを提案する。 OTROは、高速なシングルアクセスルックアップのために平方根のORAMを頼りにしているが、3つの重要なイノベーションを通じてアクセスされるすべての$O(N\log^2N$)のリビルドコストを回避している。まず、OTROは、トークン化テーブルの読み取り専用の性質を利用する、複製された平方根のORAMインスタンスのプールを提供する。第二に、エポックベースのローテーションポリシーは、各エポックの再構築とパッドからのアクセスをダミーアクセスで切り離し、観測可能な情報を最小化する。最後に、チャンクされたKVキャッシュ対応トークン化は、リビルドをGPUプリフィルでオーバーラップし、インスタンス数を最小限にする。 HuggingFace Tokenizersとnano-vLLMのモジュールとして実装され、NVIDIA H100 GPUでTDX対応CVM内で動作し、TTFTのオーバーヘッドを最大4.5%に制限し、トークン化によるレイテンシをTTFTの10%以下に維持し、メモリオーバーヘッドを0.5GB以下増やし、さまざまなモデルファミリやサイズにわたってトークン化の可観測リークを低減した。

論文の概要: OTRO: Oblivious Tokenization Path with Square-Root ORAM

関連論文リスト