FuguReport

CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

Authors Ankan Deria, Komal Kumar, Xilin He, Imran Razzak, Hisham Cholakkal, Fahad Shahbaz Khan, Salman Khan
Affiliations Mohamed bin Zayed University of Artificial Intelligence
Categories Method / Vision-Language Learning / Multi-encoder modular fusion, Application / Object Detection / RefCOCO detection task, Evaluation / Model Evaluation / Accuracy improvement over baseline
License CC BY 4.0

Abstract Overview

CoME-VL is a modular multi-encoder vision-language framework that integrates a contrastively trained SigLIP2 encoder with a self-supervised DINOv3 encoder to improve both semantic understanding and spatial grounding. The method employs entropy-guided layer selection, orthogonality-regularized multi-layer aggregation, and RoPE-enhanced cross-attention to fuse heterogeneous visual features while keeping the visual token count compact for a decoder-only LLM. Built on the Molmo architecture with a Qwen2-7B language backbone, the framework is evaluated on PixMo benchmarks and RefCOCO. Preliminary analysis and experimental results show that the two encoders contribute complementary strengths: SigLIP2 supports semantic understanding while DINOv3 provides stronger spatial and localization cues.

Novelty

The paper introduces a principled multi-encoder fusion strategy for VLMs that combines entropy-guided layer selection to identify informative features across encoder depths, orthogonality-constrained projections to reduce inter-layer redundancy, and RoPE-based cross-attention to align heterogeneous token grids without increasing the LLM's visual token burden. It is also distinctive in targeting both general visual understanding and fine-grained grounding (pointing, counting, bounding-box detection) within a single decoder-only VLM pipeline.

Results

On PixMo benchmarks, CoME-VL improves over the Molmo single-encoder baseline with reported average gains of 4.9% on visual understanding tasks and 5.4% on grounding tasks. On RefCOCO, it achieves 92.57% on val, 95.36% on testA, and 90.51% on testB, outperforming the Clip-to-DINO and Qwen-VL baselines by margins of up to +1.66%. Inference time increases modestly from 1.26s to 1.52s per sample compared to Molmo, which remains more efficient than the concatenation-based COMM approach (~2.2s/sample).

Key Points

  1. CoME-VL fuses SigLIP2 and DINOv3 features through entropy-guided layer selection, orthogonality-regularized aggregation, and RoPE-enhanced cross-attention, avoiding naive token concatenation and its associated computational overhead.
  2. Experiments indicate complementary roles for the two encoders: SigLIP2 contributes stronger semantic understanding, while DINOv3 improves grounding and localization-sensitive tasks such as pointing and counting.
  3. The model outperforms the Molmo baseline and feature-merging baselines (Clip-to-DINO, Qwen-VL) on both PixMo and RefCOCO, with ablations confirming that multi-scale layer aggregation and each fusion component (RoPE alignment, orthogonal regularization) contribute to the gains.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.