Related papers: Jina-VLM: Small Multilingual Vision Language Model

Jina-VLM: Small Multilingual Vision Language Model

URL: http://arxiv.org/abs/2512.04032v2
Date: Thu, 04 Dec 2025 12:45:29 GMT
Title: Jina-VLM: Small Multilingual Vision Language Model
Authors: Andreas Koukounas, Georgios Mastrapas, Florian Hönicke, Sedigheh Eslami, Guillaume Roncari, Scott Martens, Han Xiao,
Abstract summary: We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs.<n>The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images.
Score: 5.228874650305191
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. The model achieves leading results on standard VQA benchmarks and multilingual evaluations while preserving competitive text-only performance. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm .

Related papers

MASSV: Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models [0.09895793818721334]
We introduce Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models (MASSV)<n>MASSV transforms existing small language models into effective multimodal drafters through a two-phase approach.<n>Experiments across the Qwen2.5-VL and Gemma3 model families demonstrate that MASSV increases accepted length by up to 30% and delivers end-to-end inference speedups of up to 1.46x on visually-grounded tasks.
arXiv Detail & Related papers (2025-05-15T17:37:00Z)
APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models. APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z)
CogVLM: Visual Expert for Pretrained Language Models [56.69978233342978]
We introduce CogVLM, a powerful open-source visual language foundation model. CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC.
arXiv Detail & Related papers (2023-11-06T13:04:39Z)
Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning [27.544311403607786]
We introduce the Ziya-Visual series, a set of bilingual large-scale vision-language models (LVLMs) Our models adopt the Querying Transformer from BLIP-2, further exploring the assistance of optimization schemes. In addition, we stimulate the understanding ability of GPT-4 in multi-modal scenarios, translating our gathered English image-text datasets into Chinese.
arXiv Detail & Related papers (2023-10-12T09:39:17Z)
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond [72.41822115096741]
We introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) We endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales.
arXiv Detail & Related papers (2023-08-24T17:59:17Z)
Generalizing Multimodal Pre-training into Multilingual via Language Acquisition [54.69707237195554]
English-based Vision-Language Pre-training has achieved great success in various downstream tasks. Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training. We propose a textbfMultitextbfLingual textbfAcquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual.
arXiv Detail & Related papers (2022-05-29T08:53:22Z)
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD) Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning. The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z)
UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning. We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM) Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.