Related papers: Towards Automatic Evaluation for Image Transcreation

Towards Automatic Evaluation for Image Transcreation

URL: http://arxiv.org/abs/2412.13717v3
Date: Thu, 20 Mar 2025 21:00:43 GMT
Title: Towards Automatic Evaluation for Image Transcreation
Authors: Simran Khanuja, Vivek Iyer, Claire He, Graham Neubig,
Abstract summary: We propose a suite of automatic evaluation metrics inspired by machine translation (MT) metrics.<n>We identify three critical dimensions of image transcreation: cultural relevance, semantic equivalence and visual similarity.<n>Our results show that proprietary VLMs best identify cultural relevance and semantic equivalence, while vision-encoder representations are adept at measuring visual similarity.
Score: 52.71090829502756
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Beyond conventional paradigms of translating speech and text, recently, there has been interest in automated transcreation of images to facilitate localization of visual content across different cultures. Attempts to define this as a formal Machine Learning (ML) problem have been impeded by the lack of automatic evaluation mechanisms, with previous work relying solely on human evaluation. In this paper, we seek to close this gap by proposing a suite of automatic evaluation metrics inspired by machine translation (MT) metrics, categorized into: a) Object-based, b) Embedding-based, and c) VLM-based. Drawing on theories from translation studies and real-world transcreation practices, we identify three critical dimensions of image transcreation: cultural relevance, semantic equivalence and visual similarity, and design our metrics to evaluate systems along these axes. Our results show that proprietary VLMs best identify cultural relevance and semantic equivalence, while vision-encoder representations are adept at measuring visual similarity. Meta-evaluation across 7 countries shows our metrics agree strongly with human ratings, with average segment-level correlations ranging from 0.55-0.87. Finally, through a discussion of the merits and demerits of each metric, we offer a robust framework for automated image transcreation evaluation, grounded in both theoretical foundations and practical application. Our code can be found here: https://github.com/simran-khanuja/automatic-eval-img-transcreation.

Related papers

Evaluating Text-to-Image Synthesis with a Conditional Fréchet Distance [8.216807467478281]
evaluating text-to-image synthesis is challenging due to misalignment between established metrics and human preferences. We propose cFreD, a metric that accounts for both visual fidelity and text-prompt alignment. Our findings validate cFreD as a robust, future-proof metric for the systematic evaluation of text-to-image models.
arXiv Detail & Related papers (2025-03-27T17:35:14Z)
HMGIE: Hierarchical and Multi-Grained Inconsistency Evaluation for Vision-Language Data Cleansing [54.970275599061594]
We design an adaptive evaluation framework, called Hierarchical and Multi-Grained Inconsistency Evaluation (HMGIE)<n>HMGIE can provide multi-grained evaluations covering both accuracy and completeness for various image-caption pairs.<n>To verify the efficacy and flexibility of the proposed framework, we construct MVTID, an image-caption dataset with diverse types and granularities of inconsistencies.
arXiv Detail & Related papers (2024-12-07T15:47:49Z)
Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! [80.3129093617928]
Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics. This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings. We introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness.
arXiv Detail & Related papers (2024-08-25T13:29:34Z)
Holistic Evaluation for Interleaved Text-and-Image Generation [19.041251355695973]
We introduce InterleavedBench, the first benchmark carefully curated for the evaluation of interleaved text-and-image generation. In addition, we present InterleavedEval, a strong reference-free metric powered by GPT-4o to deliver accurate and explainable evaluation.
arXiv Detail & Related papers (2024-06-20T18:07:19Z)
Global-Local Image Perceptual Score (GLIPS): Evaluating Photorealistic Quality of AI-Generated Images [0.7499722271664147]
The Global-Local Image Perceptual Score (GLIPS) is an image metric designed to assess the photorealistic image quality of AI-generated images. Comprehensive tests across various generative models demonstrate that GLIPS consistently outperforms existing metrics like FID, SSIM, and MS-SSIM in terms of correlation with human scores.
arXiv Detail & Related papers (2024-05-15T15:19:23Z)
Evaluating MT Systems: A Theoretical Framework [0.0]
This paper outlines a theoretical framework using which different automatic metrics can be designed for evaluation of Machine Translation systems. It introduces the concept of em cognitive ease which depends on em adequacy and em lack of fluency. It can also be used to evaluate the newer types of MT systems, such as speech to speech translation and discourse translation.
arXiv Detail & Related papers (2022-02-11T18:05:17Z)
TISE: A Toolbox for Text-to-Image Synthesis Evaluation [9.092600296992925]
We conduct a study on state-of-the-art methods for single- and multi-object text-to-image synthesis. We propose a common framework for evaluating these methods.
arXiv Detail & Related papers (2021-12-02T16:39:35Z)
Transparent Human Evaluation for Image Captioning [70.03979566548823]
We develop a rubric-based human evaluation protocol for image captioning models. We show that human-generated captions show substantially higher quality than machine-generated ones. We hope that this work will promote a more transparent evaluation protocol for image captioning.
arXiv Detail & Related papers (2021-11-17T07:09:59Z)
Contrastive Semantic Similarity Learning for Image Captioning Evaluation with Intrinsic Auto-encoder [52.42057181754076]
Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric for image captioning. We develop three progressive model structures to learn the sentence level representations. Experiment results show that our proposed method can align well with the scores generated from other contemporary metrics.
arXiv Detail & Related papers (2021-06-29T12:27:05Z)
ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural Language Generation [53.56628907030751]
We propose ImaginE, an imagination-based automatic evaluation metric for natural language generation. With the help of CLIP and DALL-E, two cross-modal models pre-trained on large-scale image-text pairs, we automatically generate an image as the embodied imagination for the text snippet. Experiments spanning several text generation tasks demonstrate that adding imagination with our ImaginE displays great potential in introducing multi-modal information into NLG evaluation.
arXiv Detail & Related papers (2021-06-10T17:59:52Z)
Improving Generation and Evaluation of Visual Stories via Semantic Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions. Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task. We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)
GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics. Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation. It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.