Related papers: Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline

Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline

URL: http://arxiv.org/abs/2506.07631v1
Date: Mon, 09 Jun 2025 10:57:26 GMT
Title: Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline
Authors: Brian Gordon, Yonatan Bitton, Andreea Marzoca, Yasumasa Onoe, Xiao Wang, Daniel Cohen-Or, Idan Szpektor,
Abstract summary: We develop VNLI-Critique, a model for automated sentence-level factuality classification and critique generation.<n>We highlight three key applications: (1) VNLI-Critique demonstrates robust generalization, validated by state-of-the-art performance on the M-HalDetect benchmark; (2) The VNLI-Critique driven AutoRater for DOCCI-Critique provides reliable VLM rankings, showing excellent alignment with human factuality judgments; and (3) An innovative Critic-and-Revise pipeline, achieves substantial improvements in caption factuality.
Score: 58.832237984587664
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large Vision-Language Models (VLMs) now generate highly detailed, paragraphlength image captions, yet evaluating their factual accuracy remains challenging. Current methods often miss fine-grained errors, being designed for shorter texts or lacking datasets with verified inaccuracies. We introduce DOCCI-Critique, a benchmark with 1,400 VLM-generated paragraph captions (100 images, 14 VLMs) featuring over 10,216 sentence-level human annotations of factual correctness and explanatory rationales for errors, all within paragraph context. Building on this, we develop VNLI-Critique, a model for automated sentence-level factuality classification and critique generation. We highlight three key applications: (1) VNLI-Critique demonstrates robust generalization, validated by state-of-the-art performance on the M-HalDetect benchmark and strong results in CHOCOLATE claim verification. (2) The VNLI-Critique driven AutoRater for DOCCI-Critique provides reliable VLM rankings, showing excellent alignment with human factuality judgments (e.g., 0.98 Spearman). (3) An innovative Critic-and-Revise pipeline, where critiques from VNLI-Critique guide LLM-based corrections, achieves substantial improvements in caption factuality (e.g., a 46% gain on DetailCaps-4870). Our work offers a crucial benchmark alongside practical tools, designed to significantly elevate the standards for fine-grained evaluation and foster the improvement of VLM image understanding. Project page: https://google.github.io/unblocking-detail-caption

Related papers

RATE: Reviewer Profiling and Annotation-free Training for Expertise Ranking in Peer Review Systems [6.083097040417168]
We introduce LR-bench, a benchmark curated from 2024-2025 AI/NLP manuscripts with five-level self-assessed familiarity ratings.<n>We also propose RATE, a reviewer-centric ranking framework that distills each reviewer's recent publications into compact keyword-based profiles.<n>Our approach consistently achieves state-of-the-art performance, outperforming strong embedding baselines by a clear margin.
arXiv Detail & Related papers (2026-01-27T14:13:46Z)
Self-reflection in Automated Qualitative Coding: Improving Text Annotation through Secondary LLM Critique [1.5749416770494706]
Large language models (LLMs) allow for sophisticated qualitative coding of large datasets.<n>We present a simple, generalizable two-stage workflow: an LLM applies a human-designed codebook; a secondary LLM critic performs self-reflection on each positive label.<n>We evaluate this approach on six qualitative codes over 3,000 high-content emails from Apache Software Foundation project evaluation discussions.
arXiv Detail & Related papers (2026-01-14T22:27:13Z)
Don't Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models [11.379764847748378]
Large language models (LLMs) often uncritically accept flawed or contradictory premises, leading to inefficient reasoning and unreliable outputs.<n>This emphasizes the significance of possessing the textbfPremise Critique Ability for LLMs, defined as the capacity to proactively identify and articulate errors in input premises.<n>We introduce the textbfPremise Critique Bench (PCBench), designed by incorporating four error types across three difficulty levels, paired with multi-faceted evaluation metrics.
arXiv Detail & Related papers (2025-05-29T17:49:44Z)
DeepCritic: Deliberate Critique with Large Language Models [77.5516314477878]
We focus on studying and enhancing the math critique ability of Large Language Models (LLMs)<n>Our developed critique model built on Qwen2.5-7B-Instruct significantly outperforms existing LLM critics on various error identification benchmarks.
arXiv Detail & Related papers (2025-05-01T17:03:17Z)
Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask [30.819697001992154]
Large Language Models are a promising tool for automated vulnerability detection.<n>Despite widespread adoption, a critical question remains: Are LLMs truly effective at detecting real-world vulnerabilities?<n>This paper challenges three widely held community beliefs: that LLMs are (i) unreliable, (ii) insensitive to code patches, and (iii) performance-plateaued across model scales.
arXiv Detail & Related papers (2025-04-18T05:32:47Z)
SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications.<n>Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth.<n>Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z)
Self-Evolving Critique Abilities in Large Language Models [59.861013614500024]
This paper explores enhancing critique abilities of Large Language Models (LLMs)<n>We introduce SCRIT, a framework that trains LLMs with self-generated data to evolve their critique abilities.<n>Our analysis reveals that SCRIT's performance scales positively with data and model size.
arXiv Detail & Related papers (2025-01-10T05:51:52Z)
VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning [112.35483894933904]
We propose VISCO, the first benchmark to extensively analyze the fine-grained critique and correction capabilities of LVLMs.<n>VISCO features dense and fine-grained critique, requiring LVLMs to evaluate the correctness of each step in the chain-of-thought.<n>LookBack significantly improves critique and correction performance by up to 13.5%.
arXiv Detail & Related papers (2024-12-03T05:04:49Z)
Learning to Refine with Fine-Grained Natural Language Feedback [81.70313509881315]
We propose looking at refinement with feedback as a composition of three distinct LLM competencies. A key property of the proposed Detect, Critique, Refine ("DCR") method is that the step 2 critique model can give fine-grained feedback about errors. We show that models of different capabilities benefit from refining with DCR on the task of improving factual consistency of document grounded summaries.
arXiv Detail & Related papers (2024-07-02T16:15:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.