Fugu-MT 論文翻訳(概要): BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion

論文の概要: BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion

arxiv url: http://arxiv.org/abs/2509.08715v1
Date: Wed, 10 Sep 2025 16:09:49 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-11 15:16:52.487881
Title: BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion
Title（参考訳）: BcQLM: Q-Gated Cross-Modal Fusion を用いた高能率視覚言語理解
Authors: Sike Xiang, Shuang Chen, Amir Atapour-Abarghouei,
Abstract要約: 大規模言語モデルは、リソース制約のある環境でのデプロイメントに挑戦する。本稿では,エンドツーエンドの視覚的質問応答のための軽量MLLMフレームワークを提案する。提案手法は,効率的なマルチモーダル理解のために最適化されたコンパクトだが強力な視覚言語である BreezeCLIP を中心にしている。
参考スコア（独自算出の注目度）: 6.8723394189831035
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As multimodal large language models (MLLMs) advance, their large-scale architectures pose challenges for deployment in resource-constrained environments. In the age of large models, where energy efficiency, computational scalability and environmental sustainability are paramount, the development of lightweight and high-performance models is critical for real-world applications. As such, we propose a lightweight MLLM framework for end-to-end visual question answering. Our proposed approach centres on BreezeCLIP, a compact yet powerful vision-language encoder optimised for efficient multimodal understanding. With only 1.2 billion parameters overall, our model significantly reduces computational cost while achieving performance comparable to standard-size MLLMs. Experiments conducted on multiple datasets further validate its effectiveness in balancing accuracy and efficiency. The modular and extensible design enables generalisation to broader multimodal tasks. The proposed lightweight vision-language framework is denoted as BcQLM (BreezeCLIP-enhanced Q-Gated Multimodal Language Model). It offers a promising path toward deployable MLLMs under practical hardware constraints. The source code is available at https://github.com/thico0224/BcQLM.
Abstract（参考訳）: マルチモーダルな大規模言語モデル(MLLM)が進むにつれて、その大規模アーキテクチャはリソース制約のある環境への展開に挑戦する。エネルギー効率、計算スケーラビリティ、環境サステナビリティが最重要である大規模モデルでは、軽量で高性能なモデルの開発が現実のアプリケーションにとって重要である。そこで我々は,エンドツーエンドの視覚質問応答のための軽量MLLMフレームワークを提案する。提案手法は,効率的なマルチモーダル理解のために最適化されたコンパクトだが強力な視覚言語エンコーダである BreezeCLIP を中心にしている。全体としては12億のパラメータしか持たないため,標準サイズのMLLMに匹敵する性能を保ちながら,計算コストを大幅に削減できる。複数のデータセットで実施された実験は、精度と効率のバランスをとる上での有効性をさらに検証する。モジュラーで拡張可能な設計は、より広範なマルチモーダルタスクへの一般化を可能にする。提案する軽量ビジョン言語フレームワークはBcQLM(BreezeCLIP拡張Q-Gated Multimodal Language Model)と呼ばれる。これは、実用的なハードウェア制約下での、デプロイ可能なMLLMへの有望なパスを提供する。ソースコードはhttps://github.com/thico0224/BcQLMで入手できる。

論文の概要: BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion

関連論文リスト