Related papers: Viper-F1: Fast and Fine-Grained Multimodal Understanding with Cross-Modal State-Space Modulation

Viper-F1: Fast and Fine-Grained Multimodal Understanding with Cross-Modal State-Space Modulation

URL: http://arxiv.org/abs/2511.11177v3
Date: Tue, 18 Nov 2025 07:12:36 GMT
Title: Viper-F1: Fast and Fine-Grained Multimodal Understanding with Cross-Modal State-Space Modulation
Authors: Quoc-Huy Trinh, Mustapha Abdullahi, Do Duy Hung Trinh, Bo Zhao, Debesh Jha,
Abstract summary: We introduce Viper-F1, a Hybrid State-Space Vision-Language Model that replaces attention with efficient Liquid State-Space Dynamics.<n>We show that Viper-F1 achieves accurate, fine-grained understanding with significantly improved efficiency.
Score: 7.171333807979583
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as robotic manipulation, personal assistants, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Viper-F1, a Hybrid State-Space Vision-Language Model that replaces attention with efficient Liquid State-Space Dynamics. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates the state-space dynamics via FiLM conditioning. This enables the model to selectively emphasize visual regions relevant to the textual prompt while maintaining linear-time inference. Experimental results across multiple benchmarks demonstrate that Viper-F1 achieves accurate, fine-grained understanding with significantly improved efficiency.

Related papers

From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion [91.35078719566472]
Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection.<n>We introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities.
arXiv Detail & Related papers (2026-01-15T18:59:10Z)
Vision-LLMs for Spatiotemporal Traffic Forecasting [14.700408329373998]
Large Language Models (LLMs) inherently struggle to model the complex spatial dependencies of grid-based traffic data.<n>We propose ST-Vision-LLM, a novel framework reframe thatstemporal forecasting as a vision-language fusion problem.<n>We show that ST-Vision-LLM outperforms existing methods by 15.6% in long-term prediction accuracy and exceeds the second-best baseline by over 30.04% in cross-domain scenarios.
arXiv Detail & Related papers (2025-10-13T11:15:56Z)
CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling [84.51372201195132]
CronusVLA is a unified framework that extends single-frame VLA models to the multi-frame paradigm.<n>CronusVLA achieves leading performance and superior robustness, with a 70.9% success rate.<n>These results highlight the potential of efficient multi-frame adaptation in VLA models for more powerful and robust real-world deployment.
arXiv Detail & Related papers (2025-06-24T17:30:27Z)
LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation [17.318287255400175]
We present LaVi, a novel LVLM that enables seamless and efficient vision-language fusion.<n>Unlike dominant LVLMs that rely on visual token concatenation, LaVi bypasses long-context expansion.<n>Compared to LLaVA-OV-7B, LaVi reduces FLOPs by 94.0%, improves inference speed by 3.1 times, and cuts memory usage in half.
arXiv Detail & Related papers (2025-06-20T02:25:33Z)
DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs [124.52164183968145]
We present DyMU, an efficient, training-free framework that reduces the computational burden of vision-language models (VLMs)<n>Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity.<n>Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence.
arXiv Detail & Related papers (2025-04-23T18:38:18Z)
CoF: Coarse to Fine-Grained Image Understanding for Multi-modal Large Language Models [16.91226496250909]
We break down multi-modal understanding into two stages, from Coarse to Fine.<n>In the first stage, we prompt the MLLM to locate the approximate area of the answer.<n>In the second stage, we further enhance the model's focus on relevant areas through visual prompt engineering.
arXiv Detail & Related papers (2024-12-22T05:42:40Z)
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning [38.26304604660713]
ADEM-VL is an efficient vision-language method that tunes models based on pretrained large language models. Our framework surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset.
arXiv Detail & Related papers (2024-10-23T11:31:06Z)
EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.<n>EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z)
Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models. We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z)
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation. It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.