Fugu-MT 論文翻訳(概要): Can Vision-Language Models Handle Long-Context Code? An Empirical Study on Visual Compression

論文の概要: Can Vision-Language Models Handle Long-Context Code? An Empirical Study on Visual Compression

arxiv url: http://arxiv.org/abs/2602.00746v1
Date: Sat, 31 Jan 2026 14:23:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-03 19:28:33.370628
Title: Can Vision-Language Models Handle Long-Context Code? An Empirical Study on Visual Compression
Title（参考訳）: 視覚言語モデルは長期的コードを扱うことができるか? : 視覚圧縮に関する実証的研究
Authors: Jianping Zhong, Guochang Li, Chen Zhi, Junxiao Han, Zhen Qin, Xinkui Zhao, Nan Wang, Shuiguang Deng, Jianwei Yin,
Abstract要約: LongCodeOCRは視覚言語モデル(VLM)のためのビジュアル圧縮フレームワークであるグローバルビューを保存することで、このアプローチはフィルタリングに固有の依存性の破損を避けることができる。この結果から,視覚的なコード圧縮が,世界的理解を必要とするタスクの代替手段として有効であることが示唆された。
参考スコア（独自算出の注目度）: 36.83667074155589
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) struggle with long-context code due to window limitations. Existing textual code compression methods mitigate this via selective filtering but often disrupt dependency closure, causing semantic fragmentation. To address this, we introduce LongCodeOCR, a visual compression framework that renders code into compressed two-dimensional image sequences for Vision-Language Models (VLMs). By preserving a global view, this approach avoids the dependency breakage inherent in filtering. We systematically evaluate LongCodeOCR against the state-of-the-art LongCodeZip across four benchmarks spanning code summarization, code question answering, and code completion. Our results demonstrate that visual code compression serves as a viable alternative for tasks requiring global understanding. At comparable compression ratios ($\sim$1.7$\times$), LongCodeOCR improves CompScore on Long Module Summarization by 36.85 points over LongCodeZip. At a 1M-token context length with Glyph (a specialized 9B VLM), LongCodeOCR maintains higher accuracy than LongCodeZip while operating at about 4$\times$ higher compression. Moreover, compared with LongCodeZip, LongCodeOCR drastically reduces compression-stage overhead (reducing latency from $\sim$4.3 hours to $\sim$1 minute at 1M tokens). Finally, our results characterize a fundamental coverage--fidelity trade-off: visual code compression retains broader context coverage to support global dependencies, yet faces fidelity bottlenecks on exactness-critical tasks; by contrast, textual code compression preserves symbol-level precision while sacrificing structural coverage.
Abstract（参考訳）: 大きな言語モデル(LLM)は、ウィンドウ制限のため、長いコンテキストのコードに苦しむ。既存のテキストコード圧縮手法は、選択的フィルタリングによってこれを緩和するが、しばしば依存性のクロージャを妨害し、セマンティックな断片化を引き起こす。本稿では,視覚言語モデル(VLM)のための圧縮された2次元画像シーケンスにコードをレンダリングする,視覚圧縮フレームワークであるLongCodeOCRを紹介する。グローバルビューを保存することで、このアプローチはフィルタリングに固有の依存性の破損を避けることができる。我々は,コード要約,コード質問応答,コード補完を対象とする4つのベンチマークにおいて,最先端のLongCodeZipに対してLongCodeOCRを体系的に評価した。この結果から,視覚的なコード圧縮が,世界的理解を必要とするタスクの代替手段として有効であることが示唆された。同等の圧縮率 ($\sim$1.7$\times$) で、LongCodeOCRは、LongCodeZipよりも36.85ポイントの長モジュール要約のCompScoreを改善した。 Glyph(特殊9B VLM)による1Mのコンテキスト長では、LongCodeOCRはLongCodeZipよりも高い精度を維持し、約4$\times$高圧縮で動作する。さらに、LongCodeZipと比較して、LongCodeOCRは圧縮ステージのオーバーヘッドを大幅に削減する(レイテンシを$\sim$4.3時間から$\sim$1分まで短縮する)。視覚的コード圧縮は、グローバルな依存関係をサポートするために、より広いコンテキストカバレッジを保持するが、正確性クリティカルなタスクでは、忠実さのボトルネックに直面している。

論文の概要: Can Vision-Language Models Handle Long-Context Code? An Empirical Study on Visual Compression

関連論文リスト