CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
Abstract Overview
CoME-VL is a modular multi-encoder vision-language framework that integrates a contrastively trained SigLIP2 encoder with a self-supervised DINOv3 encoder to improve both semantic understanding and spatial grounding. The method employs entropy-guided layer selection, orthogonality-regularized multi-layer aggregation, and RoPE-enhanced cross-attention to fuse heterogeneous visual features while keeping the visual token count compact for a decoder-only LLM. Built on the Molmo architecture with a Qwen2-7B language backbone, the framework is evaluated on PixMo benchmarks and RefCOCO. Preliminary analysis and experimental results show that the two encoders contribute complementary strengths: SigLIP2 supports semantic understanding while DINOv3 provides stronger spatial and localization cues.
Novelty
The paper introduces a principled multi-encoder fusion strategy for VLMs that combines entropy-guided layer selection to identify informative features across encoder depths, orthogonality-constrained projections to reduce inter-layer redundancy, and RoPE-based cross-attention to align heterogeneous token grids without increasing the LLM's visual token burden. It is also distinctive in targeting both general visual understanding and fine-grained grounding (pointing, counting, bounding-box detection) within a single decoder-only VLM pipeline.
Results
On PixMo benchmarks, CoME-VL improves over the Molmo single-encoder baseline with reported average gains of 4.9% on visual understanding tasks and 5.4% on grounding tasks. On RefCOCO, it achieves 92.57% on val, 95.36% on testA, and 90.51% on testB, outperforming the Clip-to-DINO and Qwen-VL baselines by margins of up to +1.66%. Inference time increases modestly from 1.26s to 1.52s per sample compared to Molmo, which remains more efficient than the concatenation-based COMM approach (~2.2s/sample).
Key Points
- CoME-VL fuses SigLIP2 and DINOv3 features through entropy-guided layer selection, orthogonality-regularized aggregation, and RoPE-enhanced cross-attention, avoiding naive token concatenation and its associated computational overhead.
- Experiments indicate complementary roles for the two encoders: SigLIP2 contributes stronger semantic understanding, while DINOv3 improves grounding and localization-sensitive tasks such as pointing and counting.
- The model outperforms the Molmo baseline and feature-merging baselines (Clip-to-DINO, Qwen-VL) on both PixMo and RefCOCO, with ablations confirming that multi-scale layer aggregation and each fusion component (RoPE alignment, orthogonal regularization) contribute to the gains.