SPOLRE: Semantic Preserving Object Layout Reconstruction for Image Captioning System Testing
- URL: http://arxiv.org/abs/2407.18512v1
- Date: Fri, 26 Jul 2024 04:46:31 GMT
- Title: SPOLRE: Semantic Preserving Object Layout Reconstruction for Image Captioning System Testing
- Authors: Yi Liu, Guanyu Wang, Xinyi Zheng, Gelei Deng, Kailong Wang, Yang Liu, Haoyu Wang,
- Abstract summary: SPOLRE is an automated tool for semantic-preserving object layout reconstruction in IC system testing.
It eliminates the need for manual annotations and creates realistic, varied test suites.
SPOLRE excels in identifying caption errors, detecting 31,544 incorrect captions across seven IC systems with an average precision of 91.62%.
- Score: 12.895128109843071
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image captioning (IC) systems, such as Microsoft Azure Cognitive Service, translate image content into descriptive language but can generate inaccuracies leading to misinterpretations. Advanced testing techniques like MetaIC and ROME aim to address these issues but face significant challenges. These methods require intensive manual labor for detailed annotations and often produce unrealistic images, either by adding unrelated objects or failing to remove existing ones. Additionally, they generate limited test suites, with MetaIC restricted to inserting specific objects and ROME limited to a narrow range of variations. We introduce SPOLRE, a novel automated tool for semantic-preserving object layout reconstruction in IC system testing. SPOLRE leverages four transformation techniques to modify object layouts without altering the image's semantics. This automated approach eliminates the need for manual annotations and creates realistic, varied test suites. Our tests show that over 75% of survey respondents find SPOLRE-generated images more realistic than those from state-of-the-art methods. SPOLRE excels in identifying caption errors, detecting 31,544 incorrect captions across seven IC systems with an average precision of 91.62%, surpassing other methods which average 85.65% accuracy and identify 17,160 incorrect captions. Notably, SPOLRE identified 6,236 unique issues within Azure, demonstrating its effectiveness against one of the most advanced IC systems.
Related papers
- Mero Nagarikta: Advanced Nepali Citizenship Data Extractor with Deep Learning-Powered Text Detection and OCR [0.0]
This work proposes a robust system using YOLOv8 for accurate text object detection and an OCR algorithm based on Optimized PyTesseract.
The system, implemented within the context of a mobile application, allows for the automated extraction of important textual information.
The tested PyTesseract optimized for Nepali characters outperformed the standard OCR regarding flexibility and accuracy.
arXiv Detail & Related papers (2024-10-08T06:29:08Z) - Ground-A-Score: Scaling Up the Score Distillation for Multi-Attribute Editing [49.419619882284906]
Ground-A-Score is a powerful model-agnostic image editing method by incorporating grounding during score distillation.
The selective application with a new penalty coefficient and contrastive loss helps to precisely target editing areas.
Both qualitative assessments and quantitative analyses confirm that Ground-A-Score successfully adheres to the intricate details of extended and multifaceted prompts.
arXiv Detail & Related papers (2024-03-20T12:40:32Z) - Toward Real Text Manipulation Detection: New Dataset and New Solution [58.557504531896704]
High costs associated with professional text manipulation limit the availability of real-world datasets.
We present the Real Text Manipulation dataset, encompassing 14,250 text images.
Our contributions aim to propel advancements in real-world text tampering detection.
arXiv Detail & Related papers (2023-12-12T02:10:16Z) - Metamorphic Testing of Image Captioning Systems via Image-Level Reduction [1.3225694028747141]
In this paper, we propose REIC to perform metamorphic testing for IC systems with some image-level reduction transformations.
With the image-level reduction transformations, REIC does not artificially manipulate any objects and hence can avoid generating unreal follow-up images.
arXiv Detail & Related papers (2023-11-20T14:17:52Z) - Dynamic Prompt Learning: Addressing Cross-Attention Leakage for
Text-Based Image Editing [23.00202969969574]
We propose Dynamic Prompt Learning (DPL) to force cross-attention maps to focus on correct noun words in the text prompt.
We show improved prompt editing results for Word-Swap, Prompt Refinement, and Attention Re-weighting, especially for complex multi-object scenes.
arXiv Detail & Related papers (2023-09-27T13:55:57Z) - ROME: Testing Image Captioning Systems via Recursive Object Melting [10.111847749807923]
Recursive Object MElting (Rome) is a novel metamorphic testing approach for validating image captioning systems.
Rome assumes that the object set in the caption of an image includes the object set in the caption of a generated image after object melting.
We use Rome to test one widely-adopted image captioning API and four state-of-the-art (SOTA) algorithms.
arXiv Detail & Related papers (2023-06-04T01:38:55Z) - iEdit: Localised Text-guided Image Editing with Weak Supervision [53.082196061014734]
We propose a novel learning method for text-guided image editing.
It generates images conditioned on a source image and a textual edit prompt.
It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
arXiv Detail & Related papers (2023-05-10T07:39:14Z) - ObjectFormer for Image Manipulation Detection and Localization [118.89882740099137]
We propose ObjectFormer to detect and localize image manipulations.
We extract high-frequency features of the images and combine them with RGB features as multimodal patch embeddings.
We conduct extensive experiments on various datasets and the results verify the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-03-28T12:27:34Z) - NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media [93.51739200834837]
We propose a dataset where both image and text are unmanipulated but mismatched.
We introduce several strategies for automatic retrieval of suitable images for the given captions.
Our large-scale automatically generated NewsCLIPpings dataset requires models to jointly analyze both modalities.
arXiv Detail & Related papers (2021-04-13T01:53:26Z) - Catching Out-of-Context Misinformation with Self-supervised Learning [2.435006380732194]
We propose a new method that automatically detects out-of-context image and text pairs.
Our core idea is a self-supervised training strategy where we only need images with matching captions from different sources.
Our method achieves 82% out-of-context detection accuracy.
arXiv Detail & Related papers (2021-01-15T19:00:42Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.