Related papers: Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models

Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models

URL: http://arxiv.org/abs/2505.20612v1
Date: Tue, 27 May 2025 01:24:29 GMT
Title: Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models
Authors: Peter Robicheaux, Matvei Popov, Anish Madan, Isaac Robinson, Joseph Nelson, Deva Ramanan, Neehar Peri,
Abstract summary: We introduce Roboflow100-VL, a large-scale collection of 100 multi-modal object detection datasets.<n>We evaluate state-of-the-art models on our benchmark in zero-shot, few-shot, semi-supervised, and fully-supervised settings.
Score: 35.79522480146796
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. Rather than simply re-training VLMs on more visual data, we argue that one should align VLMs to new concepts with annotation instructions containing a few visual examples and rich textual descriptions. To this end, we introduce Roboflow100-VL, a large-scale collection of 100 multi-modal object detection datasets with diverse concepts not commonly found in VLM pre-training. We evaluate state-of-the-art models on our benchmark in zero-shot, few-shot, semi-supervised, and fully-supervised settings, allowing for comparison across data regimes. Notably, we find that VLMs like GroundingDINO and Qwen2.5-VL achieve less than 2% zero-shot accuracy on challenging medical imaging datasets within Roboflow100-VL, demonstrating the need for few-shot concept alignment. Our code and dataset are available at https://github.com/roboflow/rf100-vl/ and https://universe.roboflow.com/rf100-vl/

Related papers

Evaluating Cell Type Inference in Vision Language Models Under Varying Visual Context [0.16385815610837165]
Vision-Language Models (VLMs) have rapidly advanced alongside Large Language Models (LLMs)<n>This study evaluates the capabilities of prominent generative VLMs, such as GPT-4.1 and Gemini 2.5 Pro, for histopathology image classification tasks.
arXiv Detail & Related papers (2025-06-15T01:50:16Z)
Membership Inference Attacks against Large Vision-Language Models [40.996912464828696]
Large vision-language models (VLLMs) exhibit promising capabilities for processing multi-modal tasks across various application scenarios. Their emergence also raises significant data security concerns, given the potential inclusion of sensitive information, such as private photos and medical records. Detecting inappropriately used data in VLLMs remains a critical and unresolved issue.
arXiv Detail & Related papers (2024-11-05T08:35:08Z)
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks [60.5257456681402]
We study the potential for building universal embeddings capable of handling a wide range of downstream tasks.<n>We build a series of VLM2Vec models on SoTA VLMs like Phi-3.5-V, LLaVA-1.6 and evaluate them on MMEB's evaluation split.<n>Our results show that VLM2Vec achieves an absolute average improvement of 10% to 20% over existing multimodal embedding models.
arXiv Detail & Related papers (2024-10-07T16:14:05Z)
The Neglected Tails in Vision-Language Models [51.79913798808725]
We show that vision-language models (VLMs) excel in zero-shot recognition but their performance varies greatly across different visual concepts. We propose REtrieval-Augmented Learning (REAL) to mitigate the imbalanced performance of zero-shot VLMs.
arXiv Detail & Related papers (2024-01-23T01:25:00Z)
Revisiting Few-Shot Object Detection with Vision-Language Models [49.79495118650838]
We revisit the task of few-shot object detection (FSOD) in the context of recent foundational vision-language models (VLMs) We propose Foundational FSOD, a new benchmark protocol that evaluates detectors pre-trained on any external data. We discuss our recent CVPR 2024 Foundational FSOD competition and share insights from the community.
arXiv Detail & Related papers (2023-12-22T07:42:00Z)
Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models [31.69213233651326]
We introduce the novel task of Visual Data-Type Identification. An extensive zero-shot evaluation of 39 vision-language models (VLMs) shows a nuanced performance landscape.
arXiv Detail & Related papers (2023-10-12T17:59:30Z)
ConStruct-VL: Data-Free Continual Structured VL Concepts Learning [57.86651057895222]
We introduce the first Continual Data-Free Structured VL Concepts Learning (ConStruct-VL) benchmark. We propose a data-free method comprised of a new approach of Adrial Pseudo-Replay (APR) which generates adversarial reminders of past tasks from past task models. We show this approach outperforms all data-free methods by as much as 7% while even matching some levels of experience-replay.
arXiv Detail & Related papers (2022-11-17T18:57:03Z)
Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks [118.49566068398642]
Cross-modal encoders for vision-language (VL) tasks are often pretrained with carefully curated vision-language datasets. Unimodal encoders are pretrained with simpler annotations that are less cost-prohibitive, achieving scales of hundreds of millions to billions. We propose Multimodal Adaptive Distillation (MAD), which adaptively distills useful knowledge from pretrained encoders to cross-modal VL encoders.
arXiv Detail & Related papers (2022-04-22T04:41:04Z)
TraVLR: Now You See It, Now You Don't! A Bimodal Dataset for Evaluating Visio-Linguistic Reasoning [25.520406167426135]
We present TraVLR, a synthetic dataset comprising four visio-linguistic (V+L) reasoning tasks. Each example in TraVLR redundantly encodes the scene in two modalities, allowing either to be dropped or added during training or testing without losing relevant information. We compare the performance of four state-of-the-art V+L models, finding that while they perform well on test examples from the same modality, they all fail at cross-modal transfer.
arXiv Detail & Related papers (2021-11-21T07:22:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.