HMGIE: Hierarchical and Multi-Grained Inconsistency Evaluation for Vision-Language Data Cleansing
- URL: http://arxiv.org/abs/2412.05685v1
- Date: Sat, 07 Dec 2024 15:47:49 GMT
- Title: HMGIE: Hierarchical and Multi-Grained Inconsistency Evaluation for Vision-Language Data Cleansing
- Authors: Zihao Zhu, Hongbao Zhang, Guanzong Wu, Siwei Lyu, Baoyuan Wu,
- Abstract summary: We design an adaptive evaluation framework, called Hierarchical and Multi-Grained Inconsistency Evaluation (HMGIE)
HMGIE can provide multi-grained evaluations covering both accuracy and completeness for various image-caption pairs.
To verify the efficacy and flexibility of the proposed framework, we construct MVTID, an image-caption dataset with diverse types and granularities of inconsistencies.
- Score: 54.970275599061594
- License:
- Abstract: Visual-textual inconsistency (VTI) evaluation plays a crucial role in cleansing vision-language data. Its main challenges stem from the high variety of image captioning datasets, where differences in content can create a range of inconsistencies (\eg, inconsistencies in scene, entities, entity attributes, entity numbers, entity interactions). Moreover, variations in caption length can introduce inconsistencies at different levels of granularity as well. To tackle these challenges, we design an adaptive evaluation framework, called Hierarchical and Multi-Grained Inconsistency Evaluation (HMGIE), which can provide multi-grained evaluations covering both accuracy and completeness for various image-caption pairs. Specifically, the HMGIE framework is implemented by three consecutive modules. Firstly, the semantic graph generation module converts the image caption to a semantic graph for building a structural representation of all involved semantic items. Then, the hierarchical inconsistency evaluation module provides a progressive evaluation procedure with a dynamic question-answer generation and evaluation strategy guided by the semantic graph, producing a hierarchical inconsistency evaluation graph (HIEG). Finally, the quantitative evaluation module calculates the accuracy and completeness scores based on the HIEG, followed by a natural language explanation about the detection results. Moreover, to verify the efficacy and flexibility of the proposed framework on handling different image captioning datasets, we construct MVTID, an image-caption dataset with diverse types and granularities of inconsistencies. Extensive experiments on MVTID and other benchmark datasets demonstrate the superior performance of the proposed HMGIE to current state-of-the-art methods.
Related papers
- Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment [53.45813302866466]
We present ISG, a comprehensive evaluation framework for interleaved text-and-image generation.
ISG evaluates responses on four levels of granularity: holistic, structural, block-level, and image-specific.
In conjunction with ISG, we introduce a benchmark, ISG-Bench, encompassing 1,150 samples across 8 categories and 21 subcategories.
arXiv Detail & Related papers (2024-11-26T07:55:57Z) - Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models [49.439311430360284]
We introduce a novel data synthesis method inspired by contrastive learning and image difference captioning.
Our key idea involves challenging the model to discern both matching and distinct elements.
We leverage this generated dataset to fine-tune state-of-the-art (SOTA) MLLMs.
arXiv Detail & Related papers (2024-08-08T17:10:16Z) - Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective [44.045767657945895]
We examine the brittleness of the image-text retrieval (ITR) evaluation pipeline with a focus on concept granularity.
We evaluate four diverse state-of-the-art Vision-Language models on both the standard and fine-grained datasets under zero-shot conditions.
The results demonstrate that although perturbations generally degrade model performance, the fine-grained datasets exhibit a smaller performance drop than their standard counterparts.
arXiv Detail & Related papers (2024-07-21T18:08:44Z) - Dynamic Visual Semantic Sub-Embeddings and Fast Re-Ranking [0.5242869847419834]
We propose a Dynamic Visual Semantic Sub-Embeddings framework (DVSE) to reduce the information entropy.
To encourage the generated candidate embeddings to capture various semantic variations, we construct a mixed distribution.
We compare the performance with existing set-based method using four image feature encoders and two text feature encoders on three benchmark datasets.
arXiv Detail & Related papers (2023-09-15T04:39:11Z) - JourneyDB: A Benchmark for Generative Image Understanding [89.02046606392382]
We introduce a comprehensive dataset, referred to as JourneyDB, that caters to the domain of generative images.
Our meticulously curated dataset comprises 4 million distinct and high-quality generated images.
On our dataset, we have devised four benchmarks to assess the performance of generated image comprehension.
arXiv Detail & Related papers (2023-07-03T02:39:08Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Image Captioning with Compositional Neural Module Networks [18.27510863075184]
We introduce a hierarchical framework for image captioning that explores both compositionality and sequentiality of natural language.
Our algorithm learns to compose a detail-rich sentence by selectively attending to different modules corresponding to unique aspects of each object detected in an input image.
arXiv Detail & Related papers (2020-07-10T20:58:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.