SmolVLM: Redefining small and efficient multimodal models
- URL: http://arxiv.org/abs/2504.05299v1
- Date: Mon, 07 Apr 2025 17:58:57 GMT
- Title: SmolVLM: Redefining small and efficient multimodal models
- Authors: Andrés Marafioti, Orr Zohar, Miquel Farré, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tunstall, Leandro von Werra, Thomas Wolf,
- Abstract summary: We introduce SmolVLM, a series of compact multimodal models specifically engineered for resource-efficient inference.<n>We identify key design choices that yield substantial performance gains on image and video tasks with minimal memory footprints.<n>Our results emphasize that strategic architectural optimizations, aggressive yet efficient tokenization, and carefully curated training data significantly enhance multimodal performance.
- Score: 8.849350918179752
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Vision-Language Models (VLMs) deliver exceptional performance but require significant computational resources, limiting their deployment on mobile and edge devices. Smaller VLMs typically mirror design choices of larger models, such as extensive image tokenization, leading to inefficient GPU memory usage and constrained practicality for on-device applications. We introduce SmolVLM, a series of compact multimodal models specifically engineered for resource-efficient inference. We systematically explore architectural configurations, tokenization strategies, and data curation optimized for low computational overhead. Through this, we identify key design choices that yield substantial performance gains on image and video tasks with minimal memory footprints. Our smallest model, SmolVLM-256M, uses less than 1GB GPU memory during inference and outperforms the 300-times larger Idefics-80B model, despite an 18-month development gap. Our largest model, at 2.2B parameters, rivals state-of-the-art VLMs consuming twice the GPU memory. SmolVLM models extend beyond static images, demonstrating robust video comprehension capabilities. Our results emphasize that strategic architectural optimizations, aggressive yet efficient tokenization, and carefully curated training data significantly enhance multimodal performance, facilitating practical, energy-efficient deployments at significantly smaller scales.
Related papers
- DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs [124.52164183968145]
We present DyMU, an efficient, training-free framework that reduces the computational burden of vision-language models (VLMs)
Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity.
Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence.
arXiv Detail & Related papers (2025-04-23T18:38:18Z) - CalibQuant: 1-Bit KV Cache Quantization for Multimodal LLMs [45.77132019859689]
CalibQuant is a visual quantization strategy that drastically reduces both memory and computational overhead.<n>We achieve a 10x throughput increase on InternVL models.
arXiv Detail & Related papers (2025-02-15T05:08:01Z) - Selective State Space Memory for Large Vision-Language Models [0.0]
State Space Memory Integration (SSMI) is a novel approach for efficient fine-tuning of LVLMs.<n>SSMI captures long-range dependencies and injects task-specific visual and sequential patterns effectively.<n> experiments on benchmark datasets, including COCO Captioning, VQA, and Flickr30k, demonstrate that SSMI achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-12-13T05:40:50Z) - Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance [78.48606021719206]
Mini-InternVL is a series of MLLMs with parameters ranging from 1B to 4B, which achieves 90% of the performance with only 5% of the parameters.
We develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks.
arXiv Detail & Related papers (2024-10-21T17:58:20Z) - LV-UNet: A Lightweight and Vanilla Model for Medical Image Segmentation [16.604140484767377]
This paper introduces LVUNet, a lightweight and vanilla model that leverages pre-trained MobileNetv3-Large backbones and incorporates modules.<n> Experimental results on ISIC 2016, BUSI, CVCClinicDB, CVCColonDB, and KvairSEG datasets demonstrate a better tradeoff between performance and the computational load.
arXiv Detail & Related papers (2024-08-29T20:19:10Z) - Fine-Tuning and Deploying Large Language Models Over Edges: Issues and Approaches [64.42735183056062]
Large language models (LLMs) have transitioned from specialized models to versatile foundation models.
LLMs exhibit impressive zero-shot ability, however, require fine-tuning on local datasets and significant resources for deployment.
arXiv Detail & Related papers (2024-08-20T09:42:17Z) - DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception [66.88792390480343]
We propose DEEM, a simple but effective approach that utilizes the generative feedback of diffusion models to align the semantic distributions of the image encoder.
DEEM exhibits enhanced robustness and a superior capacity to alleviate model hallucinations while utilizing fewer trainable parameters, less pre-training data, and a smaller base model size.
arXiv Detail & Related papers (2024-05-24T05:46:04Z) - FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only
Quantization for LLMs [9.072821427818557]
Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment.
We propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs.
We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput.
arXiv Detail & Related papers (2023-08-16T23:57:41Z) - Joint Adaptive Representations for Image-Language Learning [59.40890927221377]
We propose a recipe for image-language learning, which produces effective models, outperforming bigger and more expensive ones, often trained on orders of magnitude larger datasets.
Our key finding is the joint learning of a compact vision and language representation, which adaptively and iteratively fuses the multi-modal features.
With only 40M training examples and with 39 GFLOPs our lightweight model outperforms many times larger state-of-the-art models of 2-20x more FLOPs and using bigger datasets some of which with close to 1B training examples.
arXiv Detail & Related papers (2023-05-31T15:02:02Z) - MiniVLM: A Smaller and Faster Vision-Language Model [76.35880443015493]
MiniVLM consists of two modules, a vision feature extractor and a vision-language fusion module.
MiniVLM reduces the model size by $73%$ and the inference time cost by $94%$.
arXiv Detail & Related papers (2020-12-13T03:02:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.