Disturbing Image Detection Using LMM-Elicited Emotion Embeddings
- URL: http://arxiv.org/abs/2406.12668v1
- Date: Tue, 18 Jun 2024 14:41:04 GMT
- Title: Disturbing Image Detection Using LMM-Elicited Emotion Embeddings
- Authors: Maria Tzelepi, Vasileios Mezaris,
- Abstract summary: We deal with the task of Disturbing Image Detection (DID), exploiting knowledge encoded in Large Multimodal Models (LMMs)
We propose to exploit LMM knowledge in a two-fold manner: first by extracting generic semantic descriptions, and second by extracting elicited emotions.
The proposed method significantly improves the baseline classification accuracy, achieving state-of-the-art performance on the augmented Disturbing Image Detection dataset.
- Score: 11.801596051153725
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper we deal with the task of Disturbing Image Detection (DID), exploiting knowledge encoded in Large Multimodal Models (LMMs). Specifically, we propose to exploit LMM knowledge in a two-fold manner: first by extracting generic semantic descriptions, and second by extracting elicited emotions. Subsequently, we use the CLIP's text encoder in order to obtain the text embeddings of both the generic semantic descriptions and LMM-elicited emotions. Finally, we use the aforementioned text embeddings along with the corresponding CLIP's image embeddings for performing the DID task. The proposed method significantly improves the baseline classification accuracy, achieving state-of-the-art performance on the augmented Disturbing Image Detection dataset.
Related papers
- ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI)
We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z) - Knowing Where to Focus: Attention-Guided Alignment for Text-based Person Search [64.15205542003056]
We introduce Attention-Guided Alignment (AGA) framework featuring two innovative components: Attention-Guided Mask (AGM) Modeling and Text Enrichment Module (TEM)
AGA achieves new state-of-the-art results with Rank-1 accuracy reaching 78.36%, 67.31%, and 67.4% on CUHK-PEDES, ICFG-PEDES, and RSTP, respectively.
arXiv Detail & Related papers (2024-12-19T17:51:49Z) - LMM-Regularized CLIP Embeddings for Image Classification [11.801596051153725]
We deal with image classification tasks using the powerful CLIP vision-language model.
We propose a novel Large Multimodal Model (LMM) based regularization method.
In this way, it produces embeddings with enhanced discrimination ability.
arXiv Detail & Related papers (2024-12-16T11:11:23Z) - EntityCLIP: Entity-Centric Image-Text Matching via Multimodal Attentive Contrastive Learning [38.30565103892611]
In this paper, we work towards the textbfEntity-centric textbfImage-textbfText textbfMatching (EITM) problem.
The challenge of this task mainly lies in the larger semantic gap in entity association modeling.
We devise a multimodal attentive contrastive learning framework to adapt EITM problem, developing a model named EntityCLIP.
arXiv Detail & Related papers (2024-10-23T12:12:56Z) - ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization [49.12958154544838]
ForgeryGPT is a novel framework that advances the Image Forgery Detection and localization task.
It captures high-order correlations of forged images from diverse linguistic feature spaces.
It enables explainable generation and interactive dialogue through a newly customized Large Language Model (LLM) architecture.
arXiv Detail & Related papers (2024-10-14T07:56:51Z) - AMNS: Attention-Weighted Selective Mask and Noise Label Suppression for Text-to-Image Person Retrieval [3.591122855617648]
noisy correspondence (NC) issue exists due to poor image quality and labeling errors.
Random masking augmentation may inadvertently discard critical semantic content.
Bidirectional Similarity Distribution Matching (BSDM) loss enables the model to effectively learn from positive pairs.
Weight Adjustment Focal (WAF) loss improves the model's ability to handle hard samples.
arXiv Detail & Related papers (2024-09-10T10:08:01Z) - Exploiting LMM-based knowledge for image classification tasks [11.801596051153725]
We use the MiniGPT-4 model to extract semantic descriptions for the images.
In this paper, we propose to additionally use the text encoder to obtain the text embeddings corresponding to the MiniGPT-4-generated semantic descriptions.
The experimental evaluation on three datasets validates the improved classification performance achieved by exploiting LMM-based knowledge.
arXiv Detail & Related papers (2024-06-05T08:56:24Z) - Generalizable Entity Grounding via Assistance of Large Language Model [77.07759442298666]
We propose a novel approach to densely ground visual entities from a long caption.
We leverage a large multimodal model to extract semantic nouns, a class-a segmentation model to generate entity-level segmentation, and a multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask.
arXiv Detail & Related papers (2024-02-04T16:06:05Z) - Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption.
We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts.
Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z) - TextDiff: Mask-Guided Residual Diffusion Models for Scene Text Image
Super-Resolution [18.73348268987249]
TextDiff is a diffusion-based framework tailored for scene text image super-resolution.
It achieves state-of-the-art (SOTA) performance on public benchmark datasets.
Our proposed MRD module is plug-and-play that effectively sharpens the text edges produced by SOTA methods.
arXiv Detail & Related papers (2023-08-13T11:02:16Z) - SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen
LLMs [124.29233620842462]
We introduce SPAE for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos.
The resulting lexical tokens capture both the semantic meaning and the fine-grained details needed for visual reconstruction.
Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%.
arXiv Detail & Related papers (2023-06-30T17:59:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.