Related papers: GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices

GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices

URL: http://arxiv.org/abs/2503.06019v1
Date: Sat, 08 Mar 2025 02:40:29 GMT
Title: GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices
Authors: Xudong Lu, Yinghao Chen, Renshou Wu, Haohao Gao, Xi Chen, Xue Yang, Xiangyu Zhao, Aojun Zhou, Fangyuan Li, Yafei Wen, Xiaoxin Chen, Shuai Ren, Hongsheng Li,
Abstract summary: We propose GenieBlue, an efficient MLLM structural design that integrates both linguistic and multimodal capabilities for mobile devices.<n>It acquires multimodal capabilities by duplicating specific transformer blocks for full fine-tuning and integrating lightweight LoRA modules.<n>deployed on smartphone NPUs, GenieBlue demonstrates efficiency and practicality for applications on mobile devices.
Score: 46.15092311190904
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have enabled their deployment on mobile devices. However, challenges persist in maintaining strong language capabilities and ensuring hardware compatibility, both of which are crucial for user experience and practical deployment efficiency. In our deployment process, we observe that existing MLLMs often face performance degradation on pure language tasks, and the current NPU platforms on smartphones do not support the MoE architecture, which is commonly used to preserve pure language capabilities during multimodal training. To address these issues, we systematically analyze methods to maintain pure language capabilities during the training of MLLMs, focusing on both training data and model architecture aspects. Based on these analyses, we propose GenieBlue, an efficient MLLM structural design that integrates both linguistic and multimodal capabilities for LLMs on mobile devices. GenieBlue freezes the original LLM parameters during MLLM training to maintain pure language capabilities. It acquires multimodal capabilities by duplicating specific transformer blocks for full fine-tuning and integrating lightweight LoRA modules. This approach preserves language capabilities while achieving comparable multimodal performance through extensive training. Deployed on smartphone NPUs, GenieBlue demonstrates efficiency and practicality for applications on mobile devices.

Related papers

Liquid: Language Models are Scalable and Unified Multi-modal Generators [112.71734051183726]
Liquid is an auto-regressive generation paradigm that seamlessly integrates visual comprehension and generation.<n>Unlike previous multimodal large language model (MLLM), Liquid achieves this integration using a single large language model.<n>For the first time, Liquid uncovers a scaling law that performance drop unavoidably brought by the unified training of visual and language tasks.
arXiv Detail & Related papers (2024-12-05T16:48:16Z)
Improving Multimodal Large Language Models Using Continual Learning [16.28968930137886]
Generative large language models (LLMs) exhibit impressive capabilities, which can be further augmented by integrating a pre-trained vision model into the original LLM. This study investigates this issue using the LLaVA MLLM, treating the integration as a continual learning problem. We evaluate five continual learning methods to mitigate forgetting and identify a technique that enhances visual understanding while minimizing linguistic performance loss.
arXiv Detail & Related papers (2024-10-25T18:50:40Z)
LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [70.19607283302712]
We propose a novel framework to transfer knowledge from l-MLLM to s-MLLM. Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of l-MLLM and s-MLLM. We also propose a three-stage training scheme to fully exploit the potential of s-MLLM.
arXiv Detail & Related papers (2024-10-21T17:41:28Z)
A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks [74.52259252807191]
Multimodal Large Language Models (MLLMs) address the complexities of real-world applications far beyond the capabilities of single-modality systems. This paper systematically sorts out the applications of MLLM in multimodal tasks such as natural language, vision, and audio.
arXiv Detail & Related papers (2024-08-02T15:14:53Z)
MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models [57.091523832149655]
We propose a mixture of multimodal experts (MoME) to mitigate task interference and obtain a generalist MLLM. Our MoME is composed of two key components, a mixture of vision experts (MoVE) and a mixture of language experts (MoLE)
arXiv Detail & Related papers (2024-07-17T16:31:38Z)
Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models [25.724995114710165]
We investigate the design aspects of Multimodal Small Language Models (MSLMs) and propose an efficient multimodal assistant named Mipha. Our Mipha-3B outperforms the state-of-the-art large MLLMs, especially LLaVA-1.5-13B, on multiple benchmarks.
arXiv Detail & Related papers (2024-03-10T12:43:27Z)
ModaVerse: Efficiently Transforming Modalities with LLMs [25.49713745405194]
We introduce ModaVerse, a Multi-modal Large Language Model capable of comprehending and transforming content across various modalities. We propose a novel Input/Output (I/O) alignment mechanism that operates directly at the level of natural language.
arXiv Detail & Related papers (2024-01-12T06:28:54Z)
If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code) Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z)
Confidant: Customizing Transformer-based LLMs via Collaborative Edge Training [18.526329975259483]
Transformer-based large language models (LLMs) have demonstrated impressive capabilities in a variety of natural language processing (NLP) tasks. It is challenging to deploy and fine-tune LLMs on mobile edge devices with limited computing, memory, and energy budgets. We propose Confidant, a multi-backend collaborative training framework for customizing state-of-the-art LLMs on commodity mobile devices.
arXiv Detail & Related papers (2023-11-22T13:20:59Z)
Simultaneous Machine Translation with Large Language Models [51.470478122113356]
We investigate the possibility of applying Large Language Models to SimulMT tasks. We conducted experiments using the textttLlama2-7b-chat model on nine different languages from the MUST-C dataset. The results show that LLM outperforms dedicated MT models in terms of BLEU and LAAL metrics.
arXiv Detail & Related papers (2023-09-13T04:06:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.