Disturbing Image Detection Using LMM-Elicited Emotion Embeddings
- URL: http://arxiv.org/abs/2406.12668v1
- Date: Tue, 18 Jun 2024 14:41:04 GMT
- Title: Disturbing Image Detection Using LMM-Elicited Emotion Embeddings
- Authors: Maria Tzelepi, Vasileios Mezaris,
- Abstract summary: We deal with the task of Disturbing Image Detection (DID), exploiting knowledge encoded in Large Multimodal Models (LMMs)
We propose to exploit LMM knowledge in a two-fold manner: first by extracting generic semantic descriptions, and second by extracting elicited emotions.
The proposed method significantly improves the baseline classification accuracy, achieving state-of-the-art performance on the augmented Disturbing Image Detection dataset.
- Score: 11.801596051153725
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper we deal with the task of Disturbing Image Detection (DID), exploiting knowledge encoded in Large Multimodal Models (LMMs). Specifically, we propose to exploit LMM knowledge in a two-fold manner: first by extracting generic semantic descriptions, and second by extracting elicited emotions. Subsequently, we use the CLIP's text encoder in order to obtain the text embeddings of both the generic semantic descriptions and LMM-elicited emotions. Finally, we use the aforementioned text embeddings along with the corresponding CLIP's image embeddings for performing the DID task. The proposed method significantly improves the baseline classification accuracy, achieving state-of-the-art performance on the augmented Disturbing Image Detection dataset.
Related papers
- Exploiting LMM-based knowledge for image classification tasks [11.801596051153725]
We use the MiniGPT-4 model to extract semantic descriptions for the images.
In this paper, we propose to additionally use the text encoder to obtain the text embeddings corresponding to the MiniGPT-4-generated semantic descriptions.
The experimental evaluation on three datasets validates the improved classification performance achieved by exploiting LMM-based knowledge.
arXiv Detail & Related papers (2024-06-05T08:56:24Z) - Generalizable Entity Grounding via Assistance of Large Language Model [77.07759442298666]
We propose a novel approach to densely ground visual entities from a long caption.
We leverage a large multimodal model to extract semantic nouns, a class-a segmentation model to generate entity-level segmentation, and a multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask.
arXiv Detail & Related papers (2024-02-04T16:06:05Z) - Hijacking Context in Large Multi-modal Models [3.6411220072843866]
We introduce a new limitation of off-the-shelf LMMs where a small fraction of incoherent images mislead LMMs to only generate biased output about the hijacked context.
We propose a pre-filtering method that removes irrelevant contexts via GPT-4V, based on its robustness towards distribution shift within the contexts.
We investigate whether replacing the hijacked visual and textual contexts with the correlated ones via GPT-4V and text-to-image models can help yield coherent responses.
arXiv Detail & Related papers (2023-12-07T11:23:29Z) - Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption.
We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts.
Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z) - TextDiff: Mask-Guided Residual Diffusion Models for Scene Text Image
Super-Resolution [18.73348268987249]
TextDiff is a diffusion-based framework tailored for scene text image super-resolution.
It achieves state-of-the-art (SOTA) performance on public benchmark datasets.
Our proposed MRD module is plug-and-play that effectively sharpens the text edges produced by SOTA methods.
arXiv Detail & Related papers (2023-08-13T11:02:16Z) - SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen
LLMs [124.29233620842462]
We introduce SPAE for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos.
The resulting lexical tokens capture both the semantic meaning and the fine-grained details needed for visual reconstruction.
Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%.
arXiv Detail & Related papers (2023-06-30T17:59:07Z) - Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework.
We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image.
We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z) - StrucTexTv2: Masked Visual-Textual Prediction for Document Image
Pre-training [64.37272287179661]
StrucTexTv2 is an effective document image pre-training framework.
It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling.
It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction.
arXiv Detail & Related papers (2023-03-01T07:32:51Z) - Texts as Images in Prompt Tuning for Multi-Label Image Recognition [70.9310322461598]
We advocate that image-text contrastive learning makes it feasible to treat texts as images for prompt tuning and introduce TaI prompting.
Particularly, we apply TaI prompting to multi-label image recognition, where sentences in the wild serve as alternatives to images for prompt tuning.
Our proposed TaI-DPT outperforms zero-shot CLIP by a large margin on multiple benchmarks.
arXiv Detail & Related papers (2022-11-23T07:00:11Z) - CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable, and Controllable Text-Guided Face Manipulation [4.078926358349661]
Contrastive Language-Image Pre-Training (CLIP) bridges images and text by embedding them into a joint latent space.
Due to the discrepancy between image and text embeddings in the joint space, using text embeddings as the optimization target often introduces undesired artifacts in the resulting images.
We introduce CLIP Projection-Augmentation Embedding (PAE) as an optimization target to improve the performance of text-guided image manipulation.
arXiv Detail & Related papers (2022-10-08T05:12:25Z) - A Novel Actor Dual-Critic Model for Remote Sensing Image Captioning [32.11006090613004]
We deal with the problem of generating textual captions from optical remote sensing (RS) images using the notion of deep reinforcement learning.
We introduce an Actor Dual-Critic training strategy where a second critic model is deployed in the form of an encoder-decoder RNN.
We observe that the proposed model generates sentences on the test data highly similar to the ground truth and is successful in generating even better captions in many critical cases.
arXiv Detail & Related papers (2020-10-05T13:35:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.