Related papers: LOTUS: A Leaderboard for Detailed Image Captioning from Quality to Societal Bias and User Preferences

LOTUS: A Leaderboard for Detailed Image Captioning from Quality to Societal Bias and User Preferences

URL: http://arxiv.org/abs/2507.19362v1
Date: Fri, 25 Jul 2025 15:12:42 GMT
Title: LOTUS: A Leaderboard for Detailed Image Captioning from Quality to Societal Bias and User Preferences
Authors: Yusuke Hirota, Boyi Li, Ryo Hachiuma, Yueh-Hua Wu, Boris Ivanovic, Yuta Nakashima, Marco Pavone, Yejin Choi, Yu-Chiang Frank Wang, Chao-Han Huck Yang,
Abstract summary: LOTUS is a leaderboard for evaluating detailed captions.<n>It comprehensively evaluates various aspects, including caption quality.<n>It enables preference-oriented evaluations by tailoring criteria to diverse user preferences.
Score: 91.13704541413551
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large Vision-Language Models (LVLMs) have transformed image captioning, shifting from concise captions to detailed descriptions. We introduce LOTUS, a leaderboard for evaluating detailed captions, addressing three main gaps in existing evaluations: lack of standardized criteria, bias-aware assessments, and user preference considerations. LOTUS comprehensively evaluates various aspects, including caption quality (e.g., alignment, descriptiveness), risks (\eg, hallucination), and societal biases (e.g., gender bias) while enabling preference-oriented evaluations by tailoring criteria to diverse user preferences. Our analysis of recent LVLMs reveals no single model excels across all criteria, while correlations emerge between caption detail and bias risks. Preference-oriented evaluations demonstrate that optimal model selection depends on user priorities.

Related papers

PreferThinker: Reasoning-based Personalized Image Preference Assessment [83.66114370585976]
We propose a reasoning-based personalized image preference assessment framework.<n>It first predicts a user's preference profile from reference images.<n>It then provides interpretable, multi-dimensional scores and assessments of candidate images.
arXiv Detail & Related papers (2025-11-01T16:19:51Z)
Bias in the Picture: Benchmarking VLMs with Social-Cue News Images and LLM-as-Judge Assessment [8.451522319478512]
We introduce a news-image benchmark consisting of 1,343 image-question pairs drawn from diverse outlets.<n>We evaluate a range of state-of-the-art VLMs and employ a large language model (LLM) as judge, with human verification.<n>Our findings show that: (i) visual context systematically shifts model outputs in open-ended settings; (ii) bias prevalence varies across attributes and models, with particularly high risk for gender and occupation; and (iii) higher faithfulness does not necessarily correspond to lower bias.
arXiv Detail & Related papers (2025-09-24T00:33:58Z)
PREFINE: Personalized Story Generation via Simulated User Critics and User-Specific Rubric Generation [2.8324853634693614]
PREFINE is a novel framework that extends the Critique-and-Refine paradigm to personalization.<n> PREFINE constructs a pseudo-user agent from a user's interaction history and generates user-specific rubrics.<n>Our approach holds potential for enabling efficient personalization in broader applications, such as dialogue systems, education, and recommendation.
arXiv Detail & Related papers (2025-09-16T16:39:40Z)
PREF: Reference-Free Evaluation of Personalised Text Generation in LLMs [32.27940625341602]
Personalised text generation is essential for user-centric information systems.<n>We introduce textbfPREF, a textbfPersonalised textbfReference-free textbfEvaluation textbfFramework.
arXiv Detail & Related papers (2025-08-08T14:32:31Z)
Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning [56.31096024472269]
We introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks.<n>DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units.<n>DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models.
arXiv Detail & Related papers (2025-03-10T22:53:56Z)
Evaluating Image Caption via Cycle-consistent Text-to-Image Generation [24.455344211552692]
We propose CAMScore, a reference-free automatic evaluation metric for image captioning models.<n>To circumvent the aforementioned modality gap, CAMScore utilizes a text-to-image model to generate images from captions and subsequently evaluates these generated images against the original images.<n>Experiment results show that CAMScore achieves a superior correlation with human judgments compared to existing reference-based and reference-free metrics.
arXiv Detail & Related papers (2025-01-07T06:35:34Z)
ComPO: Community Preferences for Language Model Personalization [122.54846260663922]
ComPO is a method to personalize preference optimization in language models. We collect and release ComPRed, a question answering dataset with community-level preferences from Reddit.
arXiv Detail & Related papers (2024-10-21T14:02:40Z)
CFaiRLLM: Consumer Fairness Evaluation in Large-Language Model Recommender System [16.84754752395103]
This work takes a critical stance on previous studies concerning fairness evaluation in Large Language Model (LLM)-based recommender systems.<n>We introduce CFaiRLLM, an enhanced evaluation framework that not only incorporates true preference alignment but also rigorously examines intersectional fairness.<n>To validate the efficacy of CFaiRLLM, we conducted extensive experiments using MovieLens and LastFM.
arXiv Detail & Related papers (2024-03-08T20:44:59Z)
GPTBIAS: A Comprehensive Framework for Evaluating Bias in Large Language Models [83.30078426829627]
Large language models (LLMs) have gained popularity and are being widely adopted by a large user community. The existing evaluation methods have many constraints, and their results exhibit a limited degree of interpretability. We propose a bias evaluation framework named GPTBIAS that leverages the high performance of LLMs to assess bias in models.
arXiv Detail & Related papers (2023-12-11T12:02:14Z)
Evaluating the Fairness of Discriminative Foundation Models in Computer Vision [51.176061115977774]
We propose a novel taxonomy for bias evaluation of discriminative foundation models, such as Contrastive Language-Pretraining (CLIP) We then systematically evaluate existing methods for mitigating bias in these models with respect to our taxonomy. Specifically, we evaluate OpenAI's CLIP and OpenCLIP models for key applications, such as zero-shot classification, image retrieval and image captioning.
arXiv Detail & Related papers (2023-10-18T10:32:39Z)
Calibrating LLM-Based Evaluator [92.17397504834825]
We propose AutoCalibrate, a multi-stage, gradient-free approach to calibrate and align an LLM-based evaluator toward human preference. Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels. Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration.
arXiv Detail & Related papers (2023-09-23T08:46:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.