Polos: Multimodal Metric Learning from Human Feedback for Image
Captioning
- URL: http://arxiv.org/abs/2402.18091v1
- Date: Wed, 28 Feb 2024 06:24:39 GMT
- Title: Polos: Multimodal Metric Learning from Human Feedback for Image
Captioning
- Authors: Yuiga Wada, Kanta Kaneda, Daichi Saito, Komei Sugiura
- Abstract summary: Polos is a supervised automatic evaluation metric for image captioning models.
We constructed the Polaris dataset, which comprises 131K human judgments from 550 evaluators.
Our approach achieved state-of-the-art performance on Composite, Flickr8K-Expert, Flickr8K-CF, PASCAL-50S, FOIL, and the Polaris dataset.
- Score: 1.3654846342364308
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Establishing an automatic evaluation metric that closely aligns with human
judgments is essential for effectively developing image captioning models.
Recent data-driven metrics have demonstrated a stronger correlation with human
judgments than classic metrics such as CIDEr; however they lack sufficient
capabilities to handle hallucinations and generalize across diverse images and
texts partially because they compute scalar similarities merely using
embeddings learned from tasks unrelated to image captioning evaluation. In this
study, we propose Polos, a supervised automatic evaluation metric for image
captioning models. Polos computes scores from multimodal inputs, using a
parallel feature extraction mechanism that leverages embeddings trained through
large-scale contrastive learning. To train Polos, we introduce Multimodal
Metric Learning from Human Feedback (M$^2$LHF), a framework for developing
metrics based on human feedback. We constructed the Polaris dataset, which
comprises 131K human judgments from 550 evaluators, which is approximately ten
times larger than standard datasets. Our approach achieved state-of-the-art
performance on Composite, Flickr8K-Expert, Flickr8K-CF, PASCAL-50S, FOIL, and
the Polaris dataset, thereby demonstrating its effectiveness and robustness.
Related papers
- VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models [121.03333569013148]
We introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories.
These types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives.
Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans.
arXiv Detail & Related papers (2025-04-21T17:59:53Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis [35.71703501731081]
We present the first survey and taxonomy of over 70 different image captioning metrics and their usage in hundreds of papers.
Despite the diversity of proposed metrics, the vast majority of studies rely on only five popular metrics, which we show to be weakly correlated with human ratings.
arXiv Detail & Related papers (2024-08-09T07:31:06Z) - EVALALIGN: Supervised Fine-Tuning Multimodal LLMs with Human-Aligned Data for Evaluating Text-to-Image Models [16.18275805302776]
We propose EvalAlign, a metric characterized by its accuracy, stability, and fine granularity.
We develop evaluation protocols that focus on two key dimensions: image faithfulness and text-image alignment.
EvalAlign aligns more closely with human preferences than existing metrics, confirming its effectiveness and utility in model assessment.
arXiv Detail & Related papers (2024-06-24T11:56:15Z) - Filter & Align: Leveraging Human Knowledge to Curate Image-Text Data [31.507451966555383]
We present a novel algorithm that incorporates human knowledge on image-text alignment to guide filtering vast corpus of web-crawled image-text datasets.
We collect a diverse image-text dataset where each image is associated with multiple captions from various sources.
We train a reward model on these human-preference annotations to internalize the nuanced human understanding of image-text alignment.
arXiv Detail & Related papers (2023-12-11T05:57:09Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - Learning Transferable Pedestrian Representation from Multimodal
Information Supervision [174.5150760804929]
VAL-PAT is a novel framework that learns transferable representations to enhance various pedestrian analysis tasks with multimodal information.
We first perform pre-training on LUPerson-TA dataset, where each image contains text and attribute annotations.
We then transfer the learned representations to various downstream tasks, including person reID, person attribute recognition and text-based person search.
arXiv Detail & Related papers (2023-04-12T01:20:58Z) - Borrowing Human Senses: Comment-Aware Self-Training for Social Media
Multimodal Classification [5.960550152906609]
We capture hinting features from user comments, which are retrieved via jointly leveraging visual and lingual similarity.
The classification tasks are explored via self-training in a teacher-student framework, motivated by the usually limited labeled data scales.
The results show that our method further advances the performance of previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-27T08:59:55Z) - Positive-Augmented Contrastive Learning for Image and Video Captioning
Evaluation [47.40949434032489]
We propose a new contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S)
PAC-S unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data.
Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos.
arXiv Detail & Related papers (2023-03-21T18:03:14Z) - Pixel-level Correspondence for Self-Supervised Learning from Video [56.24439897867531]
Pixel-level Correspondence (PiCo) is a method for dense contrastive learning from video.
We validate PiCo on standard benchmarks, outperforming self-supervised baselines on multiple dense prediction tasks.
arXiv Detail & Related papers (2022-07-08T12:50:13Z) - Generating More Pertinent Captions by Leveraging Semantics and Style on
Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z) - Deep Relational Metric Learning [84.95793654872399]
This paper presents a deep relational metric learning framework for image clustering and retrieval.
We learn an ensemble of features that characterizes an image from different aspects to model both interclass and intraclass distributions.
Experiments on the widely-used CUB-200-2011, Cars196, and Stanford Online Products datasets demonstrate that our framework improves existing deep metric learning methods and achieves very competitive results.
arXiv Detail & Related papers (2021-08-23T09:31:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.