Related papers: ROME: Testing Image Captioning Systems via Recursive Object Melting

ROME: Testing Image Captioning Systems via Recursive Object Melting

URL: http://arxiv.org/abs/2306.02228v2
Date: Sun, 30 Jul 2023 08:02:51 GMT
Title: ROME: Testing Image Captioning Systems via Recursive Object Melting
Authors: Boxi Yu, Zhiqing Zhong, Jiaqi Li, Yixing Yang, Shilin He, Pinjia He
Abstract summary: Recursive Object MElting (Rome) is a novel metamorphic testing approach for validating image captioning systems. Rome assumes that the object set in the caption of an image includes the object set in the caption of a generated image after object melting. We use Rome to test one widely-adopted image captioning API and four state-of-the-art (SOTA) algorithms.
Score: 10.111847749807923
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Image captioning (IC) systems aim to generate a text description of the salient objects in an image. In recent years, IC systems have been increasingly integrated into our daily lives, such as assistance for visually-impaired people and description generation in Microsoft Powerpoint. However, even the cutting-edge IC systems (e.g., Microsoft Azure Cognitive Services) and algorithms (e.g., OFA) could produce erroneous captions, leading to incorrect captioning of important objects, misunderstanding, and threats to personal safety. The existing testing approaches either fail to handle the complex form of IC system output (i.e., sentences in natural language) or generate unnatural images as test cases. To address these problems, we introduce Recursive Object MElting (Rome), a novel metamorphic testing approach for validating IC systems. Different from existing approaches that generate test cases by inserting objects, which easily make the generated images unnatural, Rome melts (i.e., remove and inpaint) objects. Rome assumes that the object set in the caption of an image includes the object set in the caption of a generated image after object melting. Given an image, Rome can recursively remove its objects to generate different pairs of images. We use Rome to test one widely-adopted image captioning API and four state-of-the-art (SOTA) algorithms. The results show that the test cases generated by Rome look much more natural than the SOTA IC testing approach and they achieve comparable naturalness to the original images. Meanwhile, by generating test pairs using 226 seed images, Rome reports a total of 9,121 erroneous issues with high precision (86.47%-92.17%). In addition, we further utilize the test cases generated by Rome to retrain the Oscar, which improves its performance across multiple evaluation metrics.

Related papers

Self-Supervised Learning for Detecting AI-Generated Faces as Anomalies [58.11545090128854]
We describe an anomaly detection method for AI-generated faces by leveraging self-supervised learning of camera-intrinsic and face-specific features purely from photographic face images. The success of our method lies in designing a pretext task that trains a feature extractor to rank four ordinal exchangeable image file format (EXIF) tags and classify artificially manipulated face images.
arXiv Detail & Related papers (2025-01-04T06:23:24Z)
Zero-Shot Detection of AI-Generated Images [54.01282123570917]
We propose a zero-shot entropy-based detector (ZED) to detect AI-generated images. Inspired by recent works on machine-generated text detection, our idea is to measure how surprising the image under analysis is compared to a model of real images. ZED achieves an average improvement of more than 3% over the SoTA in terms of accuracy.
arXiv Detail & Related papers (2024-09-24T08:46:13Z)
ABHINAW: A method for Automatic Evaluation of Typography within AI-Generated Images [0.44241702149260337]
We introduce a novel evaluation matrix designed explicitly for quantifying the performance of text and typography generation within AI-generated images. Our novel approach to calculate the score takes care of multiple redundancies such as repetition of words, case sensitivity, mixing of words, irregular incorporation of letters etc.
arXiv Detail & Related papers (2024-09-18T11:04:35Z)
SPOLRE: Semantic Preserving Object Layout Reconstruction for Image Captioning System Testing [12.895128109843071]
SPOLRE is an automated tool for semantic-preserving object layout reconstruction in IC system testing. It eliminates the need for manual annotations and creates realistic, varied test suites. SPOLRE excels in identifying caption errors, detecting 31,544 incorrect captions across seven IC systems with an average precision of 91.62%.
arXiv Detail & Related papers (2024-07-26T04:46:31Z)
A Sanity Check for AI-generated Image Detection [49.08585395873425]
We present a sanity check on whether the task of AI-generated image detection has been solved. To quantify the generalization of existing methods, we evaluate 9 off-the-shelf AI-generated image detectors on Chameleon dataset. We propose AIDE (AI-generated Image DEtector with Hybrid Features), which leverages multiple experts to simultaneously extract visual artifacts and noise patterns.
arXiv Detail & Related papers (2024-06-27T17:59:49Z)
Unified Text-to-Image Generation and Retrieval [96.72318842152148]
We propose a unified framework in the context of Multimodal Large Language Models (MLLMs) We first explore the intrinsic discrimi abilities of MLLMs and introduce a generative retrieval method to perform retrieval in a training-free manner. We then unify generation and retrieval in an autoregressive generation way and propose an autonomous decision module to choose the best-matched one between generated and retrieved images.
arXiv Detail & Related papers (2024-06-09T15:00:28Z)
RIGID: A Training-free and Model-Agnostic Framework for Robust AI-Generated Image Detection [60.960988614701414]
RIGID is a training-free and model-agnostic method for robust AI-generated image detection. RIGID significantly outperforms existing trainingbased and training-free detectors.
arXiv Detail & Related papers (2024-05-30T14:49:54Z)
TIER: Text-Image Encoder-based Regression for AIGC Image Quality Assessment [2.59079758388817]
In AIGCIQA tasks, images are typically generated by generative models using text prompts. Most existing AIGCIQA methods regress predicted scores directly from individual generated images. We propose a text-image encoder-based regression (TIER) framework to address this issue.
arXiv Detail & Related papers (2024-01-08T12:35:15Z)
Metamorphic Testing of Image Captioning Systems via Image-Level Reduction [1.3225694028747141]
In this paper, we propose REIC to perform metamorphic testing for IC systems with some image-level reduction transformations. With the image-level reduction transformations, REIC does not artificially manipulate any objects and hence can avoid generating unreal follow-up images.
arXiv Detail & Related papers (2023-11-20T14:17:52Z)
Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption. We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts. Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z)
Dynamic Prompt Learning: Addressing Cross-Attention Leakage for Text-Based Image Editing [23.00202969969574]
We propose Dynamic Prompt Learning (DPL) to force cross-attention maps to focus on correct noun words in the text prompt. We show improved prompt editing results for Word-Swap, Prompt Refinement, and Attention Re-weighting, especially for complex multi-object scenes.
arXiv Detail & Related papers (2023-09-27T13:55:57Z)
NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media [93.51739200834837]
We propose a dataset where both image and text are unmanipulated but mismatched. We introduce several strategies for automatic retrieval of suitable images for the given captions. Our large-scale automatically generated NewsCLIPpings dataset requires models to jointly analyze both modalities.
arXiv Detail & Related papers (2021-04-13T01:53:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.