How to Evaluate the Generalization of Detection? A Benchmark for
Comprehensive Open-Vocabulary Detection
- URL: http://arxiv.org/abs/2308.13177v2
- Date: Mon, 18 Dec 2023 07:29:55 GMT
- Title: How to Evaluate the Generalization of Detection? A Benchmark for
Comprehensive Open-Vocabulary Detection
- Authors: Yiyang Yao, Peng Liu, Tiancheng Zhao, Qianqian Zhang, Jiajia Liao,
Chunxin Fang, Kyusong Lee, Qing Wang
- Abstract summary: We propose a new benchmark named OVDEval, which includes 9 sub-tasks and introduces evaluations on commonsense knowledge.
The dataset is meticulously created to provide hard negatives that challenge models' true understanding of visual and linguistic input.
- Score: 25.506346503624894
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Object detection (OD) in computer vision has made significant progress in
recent years, transitioning from closed-set labels to open-vocabulary detection
(OVD) based on large-scale vision-language pre-training (VLP). However, current
evaluation methods and datasets are limited to testing generalization over
object types and referral expressions, which do not provide a systematic,
fine-grained, and accurate benchmark of OVD models' abilities. In this paper,
we propose a new benchmark named OVDEval, which includes 9 sub-tasks and
introduces evaluations on commonsense knowledge, attribute understanding,
position understanding, object relation comprehension, and more. The dataset is
meticulously created to provide hard negatives that challenge models' true
understanding of visual and linguistic input. Additionally, we identify a
problem with the popular Average Precision (AP) metric when benchmarking models
on these fine-grained label datasets and propose a new metric called
Non-Maximum Suppression Average Precision (NMS-AP) to address this issue.
Extensive experimental results show that existing top OVD models all fail on
the new tasks except for simple object types, demonstrating the value of the
proposed dataset in pinpointing the weakness of current OVD models and guiding
future research. Furthermore, the proposed NMS-AP metric is verified by
experiments to provide a much more truthful evaluation of OVD models, whereas
traditional AP metrics yield deceptive results. Data is available at
\url{https://github.com/om-ai-lab/OVDEval}
Related papers
- Open-set object detection: towards unified problem formulation and benchmarking [2.4374097382908477]
We introduce two benchmarks: a unified VOC-COCO evaluation, and the new OpenImagesRoad benchmark which provides clear hierarchical object definition besides new evaluation metrics.
State-of-the-art methods are extensively evaluated on the proposed benchmarks.
This study provides a clear problem definition, ensures consistent evaluations, and draws new conclusions about effectiveness of OSOD strategies.
arXiv Detail & Related papers (2024-11-08T13:40:01Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Open-Set Recognition in the Age of Vision-Language Models [9.306738687897889]
We investigate whether vision-language models (VLMs) for open-vocabulary perception inherently open-set models because they are trained on internet-scale datasets.
We find they introduce closed-set assumptions via their finite query set, making them vulnerable to open-set conditions.
We show that naively increasing the size of the query set to contain more and more classes does not mitigate this problem, but instead causes diminishing task performance and open-set performance.
arXiv Detail & Related papers (2024-03-25T08:14:22Z) - Weakly Supervised Open-Vocabulary Object Detection [31.605276665964787]
We propose a novel weakly supervised open-vocabulary object detection framework, namely WSOVOD, to extend traditional WSOD.
To achieve this, we explore three vital strategies, including dataset-level feature adaptation, image-level salient object localization, and region-level vision-language alignment.
arXiv Detail & Related papers (2023-12-19T18:59:53Z) - Interpretable Out-Of-Distribution Detection Using Pattern Identification [0.0]
Out-of-distribution (OoD) detection for data-based programs is a goal of paramount importance.
Common approaches in the literature tend to train detectors requiring inside-of-distribution (in-distribution, or IoD) and OoD validation samples.
We propose to use existing work from the field of explainable AI, namely the PARTICUL pattern identification algorithm, in order to build more interpretable and robust OoD detectors.
arXiv Detail & Related papers (2023-01-24T15:35:54Z) - Discover, Explanation, Improvement: An Automatic Slice Detection
Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints.
This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks.
Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z) - Open Vocabulary Object Detection with Proposal Mining and Prediction
Equalization [73.14053674836838]
Open-vocabulary object detection (OVD) aims to scale up vocabulary size to detect objects of novel categories beyond the training vocabulary.
Recent work resorts to the rich knowledge in pre-trained vision-language models.
We present MEDet, a novel OVD framework with proposal mining and prediction equalization.
arXiv Detail & Related papers (2022-06-22T14:30:41Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - A Revised Generative Evaluation of Visual Dialogue [80.17353102854405]
We propose a revised evaluation scheme for the VisDial dataset.
We measure consensus between answers generated by the model and a set of relevant answers.
We release these sets and code for the revised evaluation scheme as DenseVisDial.
arXiv Detail & Related papers (2020-04-20T13:26:45Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.