BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion
- URL: http://arxiv.org/abs/2509.08715v1
- Date: Wed, 10 Sep 2025 16:09:49 GMT
- Title: BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion
- Authors: Sike Xiang, Shuang Chen, Amir Atapour-Abarghouei,
- Abstract summary: Large language models pose challenges for deployment in resource-constrained environments.<n>We propose a lightweight MLLM framework for end-to-end visual question answering.<n>Our proposed approach centres on BreezeCLIP, a compact yet powerful vision-language optimised for efficient multimodal understanding.
- Score: 6.8723394189831035
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As multimodal large language models (MLLMs) advance, their large-scale architectures pose challenges for deployment in resource-constrained environments. In the age of large models, where energy efficiency, computational scalability and environmental sustainability are paramount, the development of lightweight and high-performance models is critical for real-world applications. As such, we propose a lightweight MLLM framework for end-to-end visual question answering. Our proposed approach centres on BreezeCLIP, a compact yet powerful vision-language encoder optimised for efficient multimodal understanding. With only 1.2 billion parameters overall, our model significantly reduces computational cost while achieving performance comparable to standard-size MLLMs. Experiments conducted on multiple datasets further validate its effectiveness in balancing accuracy and efficiency. The modular and extensible design enables generalisation to broader multimodal tasks. The proposed lightweight vision-language framework is denoted as BcQLM (BreezeCLIP-enhanced Q-Gated Multimodal Language Model). It offers a promising path toward deployable MLLMs under practical hardware constraints. The source code is available at https://github.com/thico0224/BcQLM.
Related papers
- Sparse Shortcuts: Facilitating Efficient Fusion in Multimodal Large Language Models [26.305881774348844]
multimodal large language models (MLLMs) have rapidly advanced in their ability to process data across multiple modalities.<n>In vision-language models, aligning modalities using only high-level visual features often discards the rich semantic information present in mid- and low-level features.<n>We propose SparseCut, a general cross-modal fusion architecture for MLLMs.
arXiv Detail & Related papers (2026-01-31T04:15:42Z) - From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion [91.35078719566472]
Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection.<n>We introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities.
arXiv Detail & Related papers (2026-01-15T18:59:10Z) - VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models [82.05514464090172]
Multimodal large language models (MLLMs) have significantly advanced the integration of visual and textual understanding.<n>However, their ability to generate code from multimodal inputs remains limited.<n>We introduce VisCodex, a unified framework that seamlessly merges vision and coding language models.
arXiv Detail & Related papers (2025-08-13T17:00:44Z) - Efficient Multi-modal Long Context Learning for Training-free Adaptation [96.21248144937627]
This paper introduces Efficient Multi-Modal Long Context Learning (EMLoC)<n>It embeds demonstration examples directly into the model input.<n>It condenses long-context multimodal inputs into compact, task-specific memory representations.
arXiv Detail & Related papers (2025-05-26T10:49:44Z) - LoCoML: A Framework for Real-World ML Inference Pipelines [0.0]
LoCoML is a low-code framework designed to simplify the integration of diverse machine learning models.<n>We show that LoCoML adds only a small amount of computational load, making it efficient and effective for large-scale ML integration.
arXiv Detail & Related papers (2025-01-24T01:35:08Z) - Selective State Space Memory for Large Vision-Language Models [0.0]
State Space Memory Integration (SSMI) is a novel approach for efficient fine-tuning of LVLMs.<n>SSMI captures long-range dependencies and injects task-specific visual and sequential patterns effectively.<n> experiments on benchmark datasets, including COCO Captioning, VQA, and Flickr30k, demonstrate that SSMI achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-12-13T05:40:50Z) - AmoebaLLM: Constructing Any-Shape Large Language Models for Efficient and Instant Deployment [13.977849745488339]
AmoebaLLM is a novel framework designed to enable the instant derivation of large language models of arbitrary shapes.
AmoebaLLM significantly facilitates rapid deployment tailored to various platforms and applications.
arXiv Detail & Related papers (2024-11-15T22:02:28Z) - LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [72.68665884790002]
We propose a novel framework to transfer knowledge from l-MLLMs to s-MLLMs.<n>We introduce Multimodal Distillation (MDist) to transfer teacher model's robust representations across both visual and linguistic modalities.<n>We also propose a three-stage training scheme to fully exploit the potential of the proposed distillation strategy.
arXiv Detail & Related papers (2024-10-21T17:41:28Z) - EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.<n>EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z) - Designing Large Foundation Models for Efficient Training and Inference: A Survey [35.40505841618305]
This paper focuses on modern efficient training and inference technologies on foundation models.<n>Model and System Design optimize LLM training and inference from different aspects to save computational resources.
arXiv Detail & Related papers (2024-09-03T15:35:01Z) - CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning.<n>We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy.<n>We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.