ECCV Caption: Correcting False Negatives by Collecting
Machine-and-Human-verified Image-Caption Associations for MS-COCO
- URL: http://arxiv.org/abs/2204.03359v5
- Date: Wed, 3 Jan 2024 05:30:54 GMT
- Title: ECCV Caption: Correcting False Negatives by Collecting
Machine-and-Human-verified Image-Caption Associations for MS-COCO
- Authors: Sanghyuk Chun, Wonjae Kim, Song Park, Minsuk Chang, Seong Joon Oh
- Abstract summary: We construct the Extended COCO Validation (ECCV) Caption dataset by supplying the missing associations with machine and human annotators.
Our dataset provides x3.6 positive image-to-caption associations and x8.5 caption-to-image associations compared to the original MS-COCO.
Our findings are that the existing benchmarks, such as COCO 1K R@K, COCO 5K R@K, CxC R@1 are highly correlated with each other, while the rankings change when we shift to the ECCV mAP@R.
- Score: 47.61229316655264
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image-Text matching (ITM) is a common task for evaluating the quality of
Vision and Language (VL) models. However, existing ITM benchmarks have a
significant limitation. They have many missing correspondences, originating
from the data construction process itself. For example, a caption is only
matched with one image although the caption can be matched with other similar
images and vice versa. To correct the massive false negatives, we construct the
Extended COCO Validation (ECCV) Caption dataset by supplying the missing
associations with machine and human annotators. We employ five state-of-the-art
ITM models with diverse properties for our annotation process. Our dataset
provides x3.6 positive image-to-caption associations and x8.5 caption-to-image
associations compared to the original MS-COCO. We also propose to use an
informative ranking-based metric mAP@R, rather than the popular Recall@K (R@K).
We re-evaluate the existing 25 VL models on existing and proposed benchmarks.
Our findings are that the existing benchmarks, such as COCO 1K R@K, COCO 5K
R@K, CxC R@1 are highly correlated with each other, while the rankings change
when we shift to the ECCV mAP@R. Lastly, we delve into the effect of the bias
introduced by the choice of machine annotator. Source code and dataset are
available at https://github.com/naver-ai/eccv-caption
Related papers
- BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues [47.213906345208315]
We propose BRIDGE, a new learnable and reference-free image captioning metric.
Our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores.
arXiv Detail & Related papers (2024-07-29T18:00:17Z) - Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective [44.045767657945895]
We examine the brittleness of the image-text retrieval (ITR) evaluation pipeline with a focus on concept granularity.
We evaluate four diverse state-of-the-art Vision-Language models on both the standard and fine-grained datasets under zero-shot conditions.
The results demonstrate that although perturbations generally degrade model performance, the fine-grained datasets exhibit a smaller performance drop than their standard counterparts.
arXiv Detail & Related papers (2024-07-21T18:08:44Z) - A Recipe for CAC: Mosaic-based Generalized Loss for Improved Class-Agnostic Counting [27.439965991083177]
Class counting (CAC) is a vision computation task that can be used to count the total occurrence number of any given reference objects in the query image.
Given a multi-class setting, models don't consider reference images and instead blindly match all dominant objects in the query image.
We introduce a new evaluation protocol and metrics for resolving the problem behind the existing CAC evaluation scheme.
arXiv Detail & Related papers (2024-04-15T14:23:39Z) - Image Similarity using An Ensemble of Context-Sensitive Models [2.9490616593440317]
We present a more intuitive approach to build and compare image similarity models based on labelled data.
We address the challenges of sparse sampling in the image space (R, A, B) and biases in the models trained with context-based data.
Our testing results show that the ensemble model constructed performs 5% better than the best individual context-sensitive models.
arXiv Detail & Related papers (2024-01-15T20:23:05Z) - GeneCIS: A Benchmark for General Conditional Image Similarity [21.96493413291777]
We argue that there are many notions of'similarity' and that models, like humans, should be able to adapt to these dynamically.
We propose the GeneCIS benchmark, which measures models' ability to adapt to a range of similarity conditions.
arXiv Detail & Related papers (2023-06-13T17:59:58Z) - Zero-shot Composed Text-Image Retrieval [72.43790281036584]
We consider the problem of composed image retrieval (CIR)
It aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability.
arXiv Detail & Related papers (2023-06-12T17:56:01Z) - Improving Zero-shot Generalization and Robustness of Multi-modal Models [70.14692320804178]
Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks.
We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts.
We propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy.
arXiv Detail & Related papers (2022-12-04T07:26:24Z) - Exploring Discrete Diffusion Models for Image Captioning [104.69608826164216]
We present a diffusion-based captioning model, dubbed the name DDCap, to allow more decoding flexibility.
We propose several key techniques including best-first inference, concentrated attention mask, text length prediction, and image-free training.
With 4M vision-language pre-training images and the base-sized model, we reach a CIDEr score of 125.1 on COCO.
arXiv Detail & Related papers (2022-11-21T18:12:53Z) - SSCR: Iterative Language-Based Image Editing via Self-Supervised
Counterfactual Reasoning [79.30956389694184]
Iterative Language-Based Image Editing (IL-BIE) tasks follow iterative instructions to edit images step by step.
Data scarcity is a significant issue for ILBIE as it is challenging to collect large-scale examples of images before and after instruction-based changes.
We introduce a Self-Supervised Counterfactual Reasoning framework that incorporates counterfactual thinking to overcome data scarcity.
arXiv Detail & Related papers (2020-09-21T01:45:58Z) - High-Order Information Matters: Learning Relation and Topology for
Occluded Person Re-Identification [84.43394420267794]
We propose a novel framework by learning high-order relation and topology information for discriminative features and robust alignment.
Our framework significantly outperforms state-of-the-art by6.5%mAP scores on Occluded-Duke dataset.
arXiv Detail & Related papers (2020-03-18T12:18:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.