Cheap-fake Detection with LLM using Prompt Engineering
- URL: http://arxiv.org/abs/2306.02776v1
- Date: Mon, 5 Jun 2023 11:01:00 GMT
- Title: Cheap-fake Detection with LLM using Prompt Engineering
- Authors: Guangyang Wu, Weijie Wu, Xiaohong Liu, Kele Xu, Tianjiao Wan, Wenyi
Wang
- Abstract summary: misuse of real photographs with conflicting image captions in news items is an example of the out-of-context (OOC) misuse of media.
This paper presents a novel learnable approach for detecting OOC media in ICME'23 Grand Challenge on Detecting Cheapfakes.
- Score: 16.029353282421116
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The misuse of real photographs with conflicting image captions in news items
is an example of the out-of-context (OOC) misuse of media. In order to detect
OOC media, individuals must determine the accuracy of the statement and
evaluate whether the triplet (~\textit{i.e.}, the image and two captions)
relates to the same event. This paper presents a novel learnable approach for
detecting OOC media in ICME'23 Grand Challenge on Detecting Cheapfakes. The
proposed method is based on the COSMOS structure, which assesses the coherence
between an image and captions, as well as between two captions. We enhance the
baseline algorithm by incorporating a Large Language Model (LLM), GPT3.5, as a
feature extractor. Specifically, we propose an innovative approach to feature
extraction utilizing prompt engineering to develop a robust and reliable
feature extractor with GPT3.5 model. The proposed method captures the
correlation between two captions and effectively integrates this module into
the COSMOS baseline model, which allows for a deeper understanding of the
relationship between captions. By incorporating this module, we demonstrate the
potential for significant improvements in cheap-fakes detection performance.
The proposed methodology holds promising implications for various applications
such as natural language processing, image captioning, and text-to-image
synthesis. Docker for submission is available at
https://hub.docker.com/repository/docker/mulns/ acmmmcheapfakes.
Related papers
- Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions [21.940022070054273]
We propose a three-phase framework to fine-tune existing captioning models.
First, an agent explores the environment, collecting noisy image-caption pairs.
Then, a consistent pseudo-caption for each object instance is distilled via consensus.
Finally, these pseudo-captions are used to fine-tune an off-the-shelf captioning model.
arXiv Detail & Related papers (2025-04-11T13:41:17Z) - BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues [47.213906345208315]
We propose BRIDGE, a new learnable and reference-free image captioning metric.
Our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores.
arXiv Detail & Related papers (2024-07-29T18:00:17Z) - Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding [54.532578213126065]
Most document understanding methods preserve all tokens within sub-images and treat them equally.
This neglects their different informativeness and leads to a significant increase in the number of image tokens.
We propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing.
arXiv Detail & Related papers (2024-07-19T16:11:15Z) - Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Synchronizing Vision and Language: Bidirectional Token-Masking
AutoEncoder for Referring Image Segmentation [26.262887028563163]
Referring Image (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level.
We propose a novel bidirectional token-masking autoencoder (BTMAE) inspired by the masked autoencoder (MAE)
BTMAE learns the context of image-to-language and language-to-image by reconstructing missing features in both image and language features at the token level.
arXiv Detail & Related papers (2023-11-29T07:33:38Z) - Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption.
We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts.
Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z) - Towards Effective Image Manipulation Detection with Proposal Contrastive
Learning [61.5469708038966]
We propose Proposal Contrastive Learning (PCL) for effective image manipulation detection.
Our PCL consists of a two-stream architecture by extracting two types of global features from RGB and noise views respectively.
Our PCL can be easily adapted to unlabeled data in practice, which can reduce manual labeling costs and promote more generalizable features.
arXiv Detail & Related papers (2022-10-16T13:30:13Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Contrastive Semantic Similarity Learning for Image Captioning Evaluation
with Intrinsic Auto-encoder [52.42057181754076]
Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric for image captioning.
We develop three progressive model structures to learn the sentence level representations.
Experiment results show that our proposed method can align well with the scores generated from other contemporary metrics.
arXiv Detail & Related papers (2021-06-29T12:27:05Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - An Effective Automatic Image Annotation Model Via Attention Model and
Data Equilibrium [0.0]
The proposed model has three phases, including a feature extractor, a tag generator, and an image annotator.
The experiments conducted on two benchmark datasets confirm that the superiority of the proposed model compared to the previous models.
arXiv Detail & Related papers (2020-01-26T05:59:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.