Fugu-MT 論文翻訳(概要): SemanticZip: A Pilot Framework for Lossy Text Compression with LLMs as Semantic Decompressors

論文の概要: SemanticZip: A Pilot Framework for Lossy Text Compression with LLMs as Semantic Decompressors

arxiv url: http://arxiv.org/abs/2605.24541v1
Date: Sat, 23 May 2026 12:14:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:18.174891
Title: SemanticZip: A Pilot Framework for Lossy Text Compression with LLMs as Semantic Decompressors
Title（参考訳）: SemanticZip: LLMをセマンティックデコンプレッサとして使用したロッシーテキスト圧縮のためのパイロットフレームワーク
Authors: Natalia Trukhina, Vadim Vashkelis,
Abstract要約: LLMがタスク関連の意味に拡張可能なコンパクトなコードにテキストを圧縮する。通常の要約とは異なり、SemanticZipはバイト単位の再構築を必要としない。この論文は試験的なフレームワークであり、ベンチマークの主張ではない。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text compression for large language model (LLM) systems is usually framed as token deletion, retrieval, summarization, or exact reconstruction. We study a more aggressive but explicitly lossy setting: compress text into compact codes that an LLM can expand into task-relevant meaning. We call this setting SemanticZip. Unlike lossless compression, SemanticZip does not require byte-identical reconstruction; unlike ordinary summarization, it treats model-based decompression as part of the codec and evaluates whether task-relevant semantic commitments are recovered. This paper is a pilot framework, not a benchmark claim. We formalize LLM-mediated decompression, define a protected/lossy packet architecture, and evaluate six representation regimes over five author-constructed diagnostic cases: structured prose, JSON, CCL-Core, CCL-Min, SemanticZip ASCII, and SemanticZip emoji. An independent decoder LLM reconstructs typed semantic atoms from each compressed representation, and we score Critical Atom Recall, Weighted Atom Recall, precision, and tokenizer gain. In this pilot, structured prose has the highest recoverability, with WAR = 0.956 and 19.1% o200k_base token gain. CCL-Min is the strongest balanced point, with 39.4% token gain and WAR = 0.874. SemanticZip ASCII provides the largest useful compression, with 46.5% token gain and WAR = 0.802, while emoji-heavy SemanticZip performs worse on both compression and recovery. The main contribution is not the claim that these numbers establish a universal frontier. Rather, we introduce a reproducible experimental interface for studying lossy, LLM-decompressible text codes and a design principle: safety-critical and exact commitments should remain protected, while predictable low-risk context may be semantically zipped.
Abstract（参考訳）: 大規模言語モデル(LLM)システムのためのテキスト圧縮は通常、トークンの削除、検索、要約、あるいは正確な再構築としてフレーム化される。 LLMがタスク関連の意味に拡張可能なコンパクトなコードにテキストを圧縮する。これをSemanticZipと呼ぶ。通常の要約とは異なり、モデルベースの圧縮をコーデックの一部として扱い、タスク関連セマンティックコミットメントが回復されるかどうかを評価する。この論文は試験的なフレームワークであり、ベンチマークの主張ではない。 LLMによる圧縮を形式化し、保護/ロッキーなパケットアーキテクチャを定義し、構造化散文、JSON、CCL-Core、CCL-Min、セマンティックZip ASCII、セマンティックZip絵文字の5つの症例に対して6つの表現規則を評価する。独立デコーダLLMは、圧縮された各表現から型付きセマンティックな原子を再構成し、臨界Atomリコール、重み付きAtomリコール、精度、トークン化のゲインをスコアする。このパイロットでは、構造化された散文が最も高い回復性を持ち、WAR = 0.956 と 19.1% o200k_base トークンゲインを持つ。 CCL-Minは最強のバランス点であり、39.4%のトークンゲイン、WAR = 0.874である。 SemanticZip ASCII は 46.5% のトークンゲインと WAR = 0.802 の最大の有用な圧縮を提供する。主な貢献は、これらの数が普遍的フロンティアを確立するという主張ではない。むしろ、損失のあるLLM圧縮可能なテキストコードを研究するための再現可能な実験インタフェースと設計原則を導入し、安全で正確なコミットメントは保護されなければならないが、予測可能な低リスクコンテキストは意味的に切り離される可能性がある。

論文の概要: SemanticZip: A Pilot Framework for Lossy Text Compression with LLMs as Semantic Decompressors

関連論文リスト