Fugu-MT 論文翻訳(概要): LensVLM: Selective Context Expansion for Compressed Visual Representation of Text

論文の概要: LensVLM: Selective Context Expansion for Compressed Visual Representation of Text

arxiv url: http://arxiv.org/abs/2605.07019v1
Date: Thu, 07 May 2026 23:03:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:38.662632
Title: LensVLM: Selective Context Expansion for Compressed Visual Representation of Text
Title（参考訳）: LensVLM: テキストの圧縮視覚表現のための選択文脈拡張
Authors: Roy Xie, Dan Friedman, Donghan Yu, Bowen Pan, Christopher Fifty, Jang-Hyun Kim, Xianzhi Du, Zhe Gan, Vivek Rathod, Bhuwan Dhingra,
Abstract要約: 視覚言語モデル(VLM)はテキストをレンダリング画像として処理することができ、テキストのトークン化の必要性を回避できる。提案するLensVLMは,VLMが圧縮された画像をスキャンし,関連する画像のみを選択的に拡張することのできる,推論フレームワークと学習後レシピである。 Qwen3.5-9Bベース上に構築されたLensVLMは、4.3倍の効率な圧縮でフルテキスト上界に匹敵する精度を維持している。 LensVLMはマルチモーダル文書やコード理解タスクにも一般化されており、圧縮が増加するにつれてベースラインよりも精度が向上する。
参考スコア（独自算出の注目度）: 43.63292785228071
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Vision Language Models (VLMs) offer the exciting possibility of processing text as rendered images, bypassing the need for tokenizing the text into long token sequences. Since VLM image encoders map fixed-size images to a fixed number of visual tokens, varying rendering resolution provides a fine-grained compression knob. However, accuracy deteriorates quickly as compression increases: characters shrink below the vision encoder's effective resolution, making them indistinguishable. To address this, we propose LensVLM, an inference framework and post-training recipe that enables VLMs to scan compressed images, then selectively expand only the relevant images to their uncompressed form via learned tools. Building on Qwen3.5-9B-Base, LensVLM maintains accuracy comparable to the full-text upper bound at 4.3x effective compression and outperforms retrieval-based, text- and visual-compression baselines up to 10.1x effective compression across seven text QA benchmarks. LensVLM also generalizes to multimodal document and code understanding tasks, with the accuracy gain over baselines growing as compression increases. Our analysis validates this approach: training makes visual compression robust to rendering choices, and as compression grows the model increasingly relies on expanded content rather than unreliable visual reading. The analysis also yields practical tool-choice guidance: text expansion is preferable for rendered text, while high-resolution image expansion suits native documents whose layout cues carry task-relevant information.
Abstract（参考訳）: 視覚言語モデル(VLM)は、テキストを長いトークンシーケンスにトークン化する必要性を回避し、レンダリングされた画像としてテキストを処理するエキサイティングな可能性を提供する。 VLM画像エンコーダは、固定サイズの画像を一定数の視覚トークンにマッピングするため、様々なレンダリング解像度は、きめ細かい圧縮ノブを提供する。しかし、圧縮が増加するにつれて精度は急速に低下し、文字は視覚エンコーダの有効解像度より低くなり、区別不能となる。これを解決するために,VLMが圧縮画像をスキャンし,関連する画像のみを学習ツールを介して非圧縮形式に選択的に拡張することのできる,推論フレームワークと後学習レシピであるLensVLMを提案する。 Qwen3.5-9B-Base上に構築されたLensVLMは、4.3倍の有効圧縮のフルテキスト上限に匹敵する精度を維持し、7つのテキストQAベンチマークで10.1倍の効率で検索ベース、テキスト、ビジュアル圧縮ベースラインを上回っている。 LensVLMはマルチモーダル文書やコード理解タスクにも一般化されており、圧縮が増加するにつれてベースラインよりも精度が向上する。トレーニングは、レンダリング選択に対して、視覚的圧縮を堅牢にし、圧縮が大きくなるにつれて、モデルは信頼性の低い視覚的読影よりも、拡張されたコンテンツにますます依存する。テキスト拡張はレンダリングテキストに好適であり、高解像度の画像拡張は、タスク関連情報を格納するネイティブ文書に適合する。

論文の概要: LensVLM: Selective Context Expansion for Compressed Visual Representation of Text

関連論文リスト