Fugu-MT 論文翻訳(概要): Compression as Adaptation: Implicit Visual Representation with Diffusion Foundation Models

論文の概要: Compression as Adaptation: Implicit Visual Representation with Diffusion Foundation Models

arxiv url: http://arxiv.org/abs/2603.07615v1
Date: Sun, 08 Mar 2026 12:55:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:14.912048
Title: Compression as Adaptation: Implicit Visual Representation with Diffusion Foundation Models
Title（参考訳）: 適応としての圧縮:拡散基礎モデルによる視覚表現を暗示する
Authors: Jiajun He, Zongyu Guo, Zhaoyang Jia, Xiaoyi Zhang, Jiahao Li, Xiao Li, Bin Li, José Miguel Hernández-Lobato, Yan Lu,
Abstract要約: 信号を関数としてエンコードする新しい視覚表現フレームワークを提案する。このような視覚信号の暗黙的な表現は、単一のコンパクトベクトルにハッシュすることができる。
参考スコア（独自算出の注目度）: 44.87688838240146
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern visual generative models acquire rich visual knowledge through large-scale training, yet existing visual representations (such as pixels, latents, or tokens) remain external to the model and cannot directly exploit this knowledge for compact storage or reuse. In this work, we introduce a new visual representation framework that encodes a signal as a function, which is parametrized by low-rank adaptations attached to a frozen visual generative model. Such implicit representations of visual signals, \textit{e.g.}, an 81-frame video, can further be hashed into a single compact vector, achieving strong perceptual video compression at extremely low bitrates. Beyond basic compression, the functional nature of this representation enables inference-time scaling and control, allowing additional refinement on the compression performance. More broadly, as the implicit representations directly act as a function of the generation process, this suggests a unified framework bridging visual compression and generation.
Abstract（参考訳）: 現代の視覚生成モデルは、大規模なトレーニングを通じて豊富な視覚知識を取得するが、既存の視覚表現(ピクセル、ラテント、トークンなど)はモデルの外にあり、コンパクトな記憶や再利用のために直接この知識を利用することはできない。本研究では,凍結した視覚生成モデルに付加された低ランク適応によってパラメータ化される信号を関数として符号化する新しい視覚表現フレームワークを提案する。このような視覚信号の暗黙的な表現である 81 フレームのビデオである \textit{e g } は、さらに単一のコンパクトベクトルにハッシュすることができ、非常に低ビットレートで強力な知覚ビデオ圧縮を実現する。基本的な圧縮以外にも、この表現の機能的性質は推論時のスケーリングと制御を可能にし、圧縮性能のさらなる改善を可能にしている。より広義には、暗黙の表現が生成プロセスの関数として直接作用するため、これは視覚的圧縮と生成をブリッジする統一されたフレームワークを示唆している。

論文の概要: Compression as Adaptation: Implicit Visual Representation with Diffusion Foundation Models

関連論文リスト