Fugu-MT 論文翻訳(概要): BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

論文の概要: BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

arxiv url: http://arxiv.org/abs/2604.16514v3
Date: Wed, 22 Apr 2026 04:56:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-23 15:36:10.262481
Title: BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation
Title（参考訳）: BARD: 高効率なプログレッシブブロックマージとステージワイズ蒸留による自己回帰・拡散ビジョンランゲージモデル
Authors: Baoyou Chen, Hanchen Xia, Peng Tu, Haojun Shi, Shan Mu, Weihao Yuan, Siyu Zhu,
Abstract要約: 本稿では,事前学習した自己回帰VLMを大ブロック拡散VLMに変換する,シンプルで効果的なブリッジングフレームワークであるBARDについて述べる。 $leq$4.4Mのデータにより、BARD-VLはQwen3-VLから大きなブロックdVLMに強いマルチモーダル機能を提供する。
参考スコア（独自算出の注目度）: 9.248424980709453
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autoregressive vision-language models (VLMs) deliver strong multimodal capability, but their token-by-token decoding imposes a fundamental inference bottleneck. Diffusion VLMs offer a more parallel decoding paradigm, yet directly converting a pretrained autoregressive VLM into a large-block diffusion VLM (dVLM) often leads to substantial quality degradation. In this work, we present BARD, a simple and effective bridging framework that converts a pretrained autoregressive VLM into a same-architecture, decoding-efficient dVLM. Our approach combines progressive supervised block merging, which gradually enlarges the decoding block size, with stage-wise intra-dVLM distillation from a fixed small-block diffusion anchor to recover performance lost at larger blocks. We further incorporate a mixed noise scheduler to improve robustness and token revision during denoising, and memory-friendly training to enable efficient training on long multimodal sequences. A key empirical finding is that direct autoregressive-to-diffusion distillation is poorly aligned and can even hurt performance, whereas distillation within the diffusion regime is consistently effective. Experimental results show that, with $\leq$ 4.4M data, BARD-VL transfers strong multimodal capability from Qwen3-VL to a large-block dVLM. Remarkably, BARD-VL establishes a new SOTA among comparable-scale open dVLMs on our evaluation suite at both 4B and 8B scales. At the same time, BARD-VL achieves up to 3$\times$ decoding throughput speedup compared to the source model. Code is available at: $\href{https://github.com/fudan-generative-vision/Bard-VL}{this~https~URL}$.
Abstract（参考訳）: 自己回帰視覚言語モデル(VLM)は強力なマルチモーダル機能を提供するが、トークン・バイ・トーケンデコーディングは基本的な推論ボトルネックを課している。拡散VLMは、より並列なデコードパラダイムを提供するが、事前訓練された自己回帰VLMを大きなブロック拡散VLM(dVLM)に変換することで、しばしば大幅な品質劣化をもたらす。本稿では,事前学習した自己回帰型VLMを,デコード効率のよい同一構造に変換する,シンプルで効果的なブリッジングフレームワークであるBARDを紹介する。提案手法は,デコードブロックサイズを徐々に拡大するプログレッシブ・トラスト・ブロック・マージと,固定された小ブロック拡散アンカーからの段階的dVLM蒸留とを組み合わせて,より大きなブロックで失われる性能を回復する。さらに、ノイズスケジューラを混在させ、デノイング時の堅牢性とトークンリビジョンを改善し、メモリフレンドリーなトレーニングを行い、長いマルチモーダルシーケンスの効率的なトレーニングを可能にする。実験的な重要な発見は、直接自己回帰拡散蒸留は不整合であり、性能を損なうことさえあるが、拡散状態における蒸留は一貫して有効であるということである。実験の結果、$\leq$4.4Mのデータにより、BARD-VLはQwen3-VLから大きなブロックdVLMへ強いマルチモーダル能力を転送することがわかった。興味深いことに、BARD-VLは4Bスケールと8Bスケールの両方で評価スイート上で、同等規模のオープンdVLMの中で新しいSOTAを確立します。同時に、BARD-VLはソースモデルと比較して最大3$\times$デコードスループットの高速化を実現している。コードは以下の通りである。 $\href{https://github.com/fudan-generative-vision/Bard-VL}{this~https~URL}$

論文の概要: BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

関連論文リスト