AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs
- URL: http://arxiv.org/abs/2511.20515v3
- Date: Tue, 02 Dec 2025 07:11:36 GMT
- Title: AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs
- Authors: Kuniaki Saito, Risa Shinoda, Shohei Tanaka, Tosho Hirasawa, Fumio Okura, Yoshitaka Ushiku,
- Abstract summary: AlignBench is a benchmark that provides a new indicator of image-text alignment.<n>It evaluates detailed image-caption pairs generated by diverse image-to-text and text-to-image models.<n>Each sentence is annotated for correctness, enabling direct assessment of VLMs as alignment evaluators.
- Score: 27.133240420463807
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Assessing image-text alignment models such as CLIP is crucial for bridging visual and linguistic representations. Yet existing benchmarks rely on rule-based perturbations or short captions, limiting their ability to measure fine-grained alignment. We introduce AlignBench, a benchmark that provides a new indicator of image-text alignment by evaluating detailed image-caption pairs generated by diverse image-to-text and text-to-image models. Each sentence is annotated for correctness, enabling direct assessment of VLMs as alignment evaluators. Benchmarking a wide range of decoder-based VLMs reveals three key findings: (i) CLIP-based models, even those tailored for compositional reasoning, remain nearly blind; (ii) detectors systematically over-score early sentences; and (iii) they show strong self-preference, favoring their own outputs and harming detection performance. Our project page will be available at https://dahlian00.github.io/AlignBench/.
Related papers
- Will It Zero-Shot?: Predicting Zero-Shot Classification Performance For Arbitrary Queries [19.511404894563455]
We build on prior work using text-only comparisons to evaluate how well a model works for a given natural language task.<n>We explore approaches that also generate synthetic images relevant to that task to evaluate and refine the prediction of zero-shot accuracy.
arXiv Detail & Related papers (2026-01-24T17:30:23Z) - Image Recognition with Vision and Language Embeddings of VLMs [14.022566577479322]
Vision-language models (VLMs) have enabled strong zero-shot classification through image-text alignment.<n>We conduct a comprehensive evaluation of both language-guided and vision-only image classification with a diverse set of dual-encoder VLMs.<n>Key factors affecting accuracy are analysed, including prompt design, class diversity, the number of neighbours in k-NN, and reference set size.
arXiv Detail & Related papers (2025-09-11T09:54:25Z) - Redemption Score: A Multi-Modal Evaluation Framework for Image Captioning via Distributional, Perceptual, and Linguistic Signal Triangulation [3.4998703934432682]
Redemption Score(RS) is a novel framework that ranks image captions by triangulating three complementary signals.<n>On the Flickr8k benchmark, RS achieves a Kendall-$tau$ of 58.42, outperforming most prior methods.
arXiv Detail & Related papers (2025-05-22T03:35:12Z) - Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning [56.31096024472269]
We introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks.<n>DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units.<n>DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models.
arXiv Detail & Related papers (2025-03-10T22:53:56Z) - Extract Free Dense Misalignment from CLIP [7.0247398611254175]
This work proposes a novel approach, dubbed CLIP4DM, for detecting dense misalignments from pre-trained CLIP.<n>We revamp the gradient-based attribution computation method, enabling negative gradient of individual text tokens to indicate misalignment.<n>Our method demonstrates state-of-the-art performance among zero-shot models and competitive performance with fine-tuned models.
arXiv Detail & Related papers (2024-12-24T12:51:05Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image
Alignment with Iterative VQA Feedback [20.78162037954646]
We introduce a decompositional approach towards evaluation and improvement of text-to-image alignment.
Human user studies indicate that the proposed approach surpasses previous state-of-the-art by 8.7% in overall text-to-image alignment accuracy.
arXiv Detail & Related papers (2023-07-10T17:54:57Z) - CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model [55.321010757641524]
We introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP.<n>We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 13 STR benchmarks.
arXiv Detail & Related papers (2023-05-23T12:51:20Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Contrastive Semantic Similarity Learning for Image Captioning Evaluation
with Intrinsic Auto-encoder [52.42057181754076]
Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric for image captioning.
We develop three progressive model structures to learn the sentence level representations.
Experiment results show that our proposed method can align well with the scores generated from other contemporary metrics.
arXiv Detail & Related papers (2021-06-29T12:27:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.