2026-04-03 Daily Report: CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

Authors Ankan Deria, Komal Kumar, Xilin He, Imran Razzak, Hisham Cholakkal, Fahad Shahbaz Khan, Salman Khan

Affiliations Mohamed bin Zayed University of Artificial Intelligence

Categories Method / Vision-Language Learning / Multi-encoder modular fusion, Application / Object Detection / RefCOCO detection task, Evaluation / Model Evaluation / Accuracy improvement over baseline

License CC BY 4.0

Abstract Overview

CoME-VL is a modular multi-encoder vision-language framework that integrates a contrastively trained SigLIP2 encoder with a self-supervised DINOv3 encoder to improve both semantic understanding and spatial grounding. The method employs entropy-guided layer selection, orthogonality-regularized multi-layer aggregation, and RoPE-enhanced cross-attention to fuse heterogeneous visual features while keeping the visual token count compact for a decoder-only LLM. Built on the Molmo architecture with a Qwen2-7B language backbone, the framework is evaluated on PixMo benchmarks and RefCOCO. Preliminary analysis and experimental results show that the two encoders contribute complementary strengths: SigLIP2 supports semantic understanding while DINOv3 provides stronger spatial and localization cues.

Novelty

The paper introduces a principled multi-encoder fusion strategy for VLMs that combines entropy-guided layer selection to identify informative features across encoder depths, orthogonality-constrained projections to reduce inter-layer redundancy, and RoPE-based cross-attention to align heterogeneous token grids without increasing the LLM's visual token burden. It is also distinctive in targeting both general visual understanding and fine-grained grounding (pointing, counting, bounding-box detection) within a single decoder-only VLM pipeline.

Results

On PixMo benchmarks, CoME-VL improves over the Molmo single-encoder baseline with reported average gains of 4.9% on visual understanding tasks and 5.4% on grounding tasks. On RefCOCO, it achieves 92.57% on val, 95.36% on testA, and 90.51% on testB, outperforming the Clip-to-DINO and Qwen-VL baselines by margins of up to +1.66%. Inference time increases modestly from 1.26s to 1.52s per sample compared to Molmo, which remains more efficient than the concatenation-based COMM approach (~2.2s/sample).

Key Points

CoME-VL fuses SigLIP2 and DINOv3 features through entropy-guided layer selection, orthogonality-regularized aggregation, and RoPE-enhanced cross-attention, avoiding naive token concatenation and its associated computational overhead.
Experiments indicate complementary roles for the two encoders: SigLIP2 contributes stronger semantic understanding, while DINOv3 improves grounding and localization-sensitive tasks such as pointing and counting.
The model outperforms the Molmo baseline and feature-merging baselines (Clip-to-DINO, Qwen-VL) on both PixMo and RefCOCO, with ablations confirming that multi-scale layer aggregation and each fusion component (RoPE alignment, orthogonal regularization) contribute to the gains.

References

arXiv: https://arxiv.org/abs/2604.03231v1
Fugu-MT: https://fugumt.com/fugumt/paper_check/2604.03231v1
GitHub: https://github.com/mbzuai-oryx/CoME-VL
Hugging Face: https://huggingface.co/MBZUAI/CoME-VL
Project: https://mbzuai-oryx.github.io/CoME-VL/

GitHub Hugging Face Project