Towards Automatic Evaluation for Image Transcreation
- URL: http://arxiv.org/abs/2412.13717v2
- Date: Fri, 10 Jan 2025 04:09:17 GMT
- Title: Towards Automatic Evaluation for Image Transcreation
- Authors: Simran Khanuja, Vivek Iyer, Claire He, Graham Neubig,
- Abstract summary: We propose a suite of automatic evaluation metrics inspired by machine translation (MT) metrics.
We identify three critical dimensions of image transcreation: cultural relevance, semantic equivalence and visual similarity.
Our results show that proprietary VLMs best identify cultural relevance and semantic equivalence, while vision-encoder representations are adept at measuring visual similarity.
- Score: 52.71090829502756
- License:
- Abstract: Beyond conventional paradigms of translating speech and text, recently, there has been interest in automated transcreation of images to facilitate localization of visual content across different cultures. Attempts to define this as a formal Machine Learning (ML) problem have been impeded by the lack of automatic evaluation mechanisms, with previous work relying solely on human evaluation. In this paper, we seek to close this gap by proposing a suite of automatic evaluation metrics inspired by machine translation (MT) metrics, categorized into: a) Object-based, b) Embedding-based, and c) VLM-based. Drawing on theories from translation studies and real-world transcreation practices, we identify three critical dimensions of image transcreation: cultural relevance, semantic equivalence and visual similarity, and design our metrics to evaluate systems along these axes. Our results show that proprietary VLMs best identify cultural relevance and semantic equivalence, while vision-encoder representations are adept at measuring visual similarity. Meta-evaluation across 7 countries shows our metrics agree strongly with human ratings, with average segment-level correlations ranging from 0.55-0.87. Finally, through a discussion of the merits and demerits of each metric, we offer a robust framework for automated image transcreation evaluation, grounded in both theoretical foundations and practical application. Our code can be found here: https://github.com/simran-khanuja/automatic-eval-transcreation
Related papers
- HMGIE: Hierarchical and Multi-Grained Inconsistency Evaluation for Vision-Language Data Cleansing [54.970275599061594]
We design an adaptive evaluation framework, called Hierarchical and Multi-Grained Inconsistency Evaluation (HMGIE)
HMGIE can provide multi-grained evaluations covering both accuracy and completeness for various image-caption pairs.
To verify the efficacy and flexibility of the proposed framework, we construct MVTID, an image-caption dataset with diverse types and granularities of inconsistencies.
arXiv Detail & Related papers (2024-12-07T15:47:49Z) - Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! [80.3129093617928]
Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics.
This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings.
We introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness.
arXiv Detail & Related papers (2024-08-25T13:29:34Z) - Holistic Evaluation for Interleaved Text-and-Image Generation [19.041251355695973]
We introduce InterleavedBench, the first benchmark carefully curated for the evaluation of interleaved text-and-image generation.
In addition, we present InterleavedEval, a strong reference-free metric powered by GPT-4o to deliver accurate and explainable evaluation.
arXiv Detail & Related papers (2024-06-20T18:07:19Z) - Global-Local Image Perceptual Score (GLIPS): Evaluating Photorealistic Quality of AI-Generated Images [0.7499722271664147]
The Global-Local Image Perceptual Score (GLIPS) is an image metric designed to assess the photorealistic image quality of AI-generated images.
Comprehensive tests across various generative models demonstrate that GLIPS consistently outperforms existing metrics like FID, SSIM, and MS-SSIM in terms of correlation with human scores.
arXiv Detail & Related papers (2024-05-15T15:19:23Z) - Unlocking Structure Measuring: Introducing PDD, an Automatic Metric for Positional Discourse Coherence [39.065349875944634]
We present a novel metric designed to quantify the discourse divergence between two long-form articles.
Our metric aligns more closely with human preferences and GPT-4 coherence evaluation, outperforming existing evaluation methods.
arXiv Detail & Related papers (2024-02-15T18:23:39Z) - Towards Explainable Evaluation Metrics for Natural Language Generation [36.594817754285984]
We identify key properties and propose key goals of explainable machine translation evaluation metrics.
We conduct own novel experiments, which find that current adversarial NLP techniques are unsuitable for automatically identifying limitations of high-quality black-box evaluation metrics.
arXiv Detail & Related papers (2022-03-21T17:05:54Z) - Evaluating MT Systems: A Theoretical Framework [0.0]
This paper outlines a theoretical framework using which different automatic metrics can be designed for evaluation of Machine Translation systems.
It introduces the concept of em cognitive ease which depends on em adequacy and em lack of fluency.
It can also be used to evaluate the newer types of MT systems, such as speech to speech translation and discourse translation.
arXiv Detail & Related papers (2022-02-11T18:05:17Z) - Transparent Human Evaluation for Image Captioning [70.03979566548823]
We develop a rubric-based human evaluation protocol for image captioning models.
We show that human-generated captions show substantially higher quality than machine-generated ones.
We hope that this work will promote a more transparent evaluation protocol for image captioning.
arXiv Detail & Related papers (2021-11-17T07:09:59Z) - Contrastive Semantic Similarity Learning for Image Captioning Evaluation
with Intrinsic Auto-encoder [52.42057181754076]
Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric for image captioning.
We develop three progressive model structures to learn the sentence level representations.
Experiment results show that our proposed method can align well with the scores generated from other contemporary metrics.
arXiv Detail & Related papers (2021-06-29T12:27:05Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.