Related papers: Disturbing Image Detection Using LMM-Elicited Emotion Embeddings

Disturbing Image Detection Using LMM-Elicited Emotion Embeddings

URL: http://arxiv.org/abs/2406.12668v1
Date: Tue, 18 Jun 2024 14:41:04 GMT
Title: Disturbing Image Detection Using LMM-Elicited Emotion Embeddings
Authors: Maria Tzelepi, Vasileios Mezaris,
Abstract summary: We deal with the task of Disturbing Image Detection (DID), exploiting knowledge encoded in Large Multimodal Models (LMMs) We propose to exploit LMM knowledge in a two-fold manner: first by extracting generic semantic descriptions, and second by extracting elicited emotions. The proposed method significantly improves the baseline classification accuracy, achieving state-of-the-art performance on the augmented Disturbing Image Detection dataset.
Score: 11.801596051153725
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper we deal with the task of Disturbing Image Detection (DID), exploiting knowledge encoded in Large Multimodal Models (LMMs). Specifically, we propose to exploit LMM knowledge in a two-fold manner: first by extracting generic semantic descriptions, and second by extracting elicited emotions. Subsequently, we use the CLIP's text encoder in order to obtain the text embeddings of both the generic semantic descriptions and LMM-elicited emotions. Finally, we use the aforementioned text embeddings along with the corresponding CLIP's image embeddings for performing the DID task. The proposed method significantly improves the baseline classification accuracy, achieving state-of-the-art performance on the augmented Disturbing Image Detection dataset.

Related papers

ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI) We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z)
Knowing Where to Focus: Attention-Guided Alignment for Text-based Person Search [64.15205542003056]
We introduce Attention-Guided Alignment (AGA) framework featuring two innovative components: Attention-Guided Mask (AGM) Modeling and Text Enrichment Module (TEM) AGA achieves new state-of-the-art results with Rank-1 accuracy reaching 78.36%, 67.31%, and 67.4% on CUHK-PEDES, ICFG-PEDES, and RSTP, respectively.
arXiv Detail & Related papers (2024-12-19T17:51:49Z)
LMM-Regularized CLIP Embeddings for Image Classification [11.801596051153725]
We deal with image classification tasks using the powerful CLIP vision-language model. We propose a novel Large Multimodal Model (LMM) based regularization method. In this way, it produces embeddings with enhanced discrimination ability.
arXiv Detail & Related papers (2024-12-16T11:11:23Z)
EntityCLIP: Entity-Centric Image-Text Matching via Multimodal Attentive Contrastive Learning [38.30565103892611]
In this paper, we work towards the textbfEntity-centric textbfImage-textbfText textbfMatching (EITM) problem. The challenge of this task mainly lies in the larger semantic gap in entity association modeling. We devise a multimodal attentive contrastive learning framework to adapt EITM problem, developing a model named EntityCLIP.
arXiv Detail & Related papers (2024-10-23T12:12:56Z)
ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization [49.992614129625274]
ForgeryGPT is a novel framework that advances the Image Forgery Detection and localization task. It captures high-order correlations of forged images from diverse linguistic feature spaces. It enables explainable generation and interactive dialogue through a newly customized Large Language Model (LLM) architecture.
arXiv Detail & Related papers (2024-10-14T07:56:51Z)
AMNS: Attention-Weighted Selective Mask and Noise Label Suppression for Text-to-Image Person Retrieval [3.591122855617648]
Under-correlated and false-correlated problems arise for image-text pairs due to poor image quality and mislabeling. We propose a new noise label suppression method and alleviate the problem generated by random mask.
arXiv Detail & Related papers (2024-09-10T10:08:01Z)
Large Language Models for Multimodal Deformable Image Registration [50.91473745610945]
We propose a novel coarse-to-fine MDIR framework,LLM-Morph, for aligning the deep features from different modal medical images. Specifically, we first utilize a CNN encoder to extract deep visual features from cross-modal image pairs, then we use the first adapter to adjust these tokens, and use LoRA in pre-trained LLMs to fine-tune their weights. Third, for the alignment of tokens, we utilize other four adapters to transform the LLM-encoded tokens into multi-scale visual features, generating multi-scale deformation fields and facilitating the coarse-to-fine MDIR task
arXiv Detail & Related papers (2024-08-20T09:58:30Z)
Exploiting LMM-based knowledge for image classification tasks [11.801596051153725]
We use the MiniGPT-4 model to extract semantic descriptions for the images. In this paper, we propose to additionally use the text encoder to obtain the text embeddings corresponding to the MiniGPT-4-generated semantic descriptions. The experimental evaluation on three datasets validates the improved classification performance achieved by exploiting LMM-based knowledge.
arXiv Detail & Related papers (2024-06-05T08:56:24Z)
Generalizable Entity Grounding via Assistance of Large Language Model [77.07759442298666]
We propose a novel approach to densely ground visual entities from a long caption. We leverage a large multimodal model to extract semantic nouns, a class-a segmentation model to generate entity-level segmentation, and a multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask.
arXiv Detail & Related papers (2024-02-04T16:06:05Z)
Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption. We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts. Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z)
TextDiff: Mask-Guided Residual Diffusion Models for Scene Text Image Super-Resolution [18.73348268987249]
TextDiff is a diffusion-based framework tailored for scene text image super-resolution. It achieves state-of-the-art (SOTA) performance on public benchmark datasets. Our proposed MRD module is plug-and-play that effectively sharpens the text edges produced by SOTA methods.
arXiv Detail & Related papers (2023-08-13T11:02:16Z)
SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs [124.29233620842462]
We introduce SPAE for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. The resulting lexical tokens capture both the semantic meaning and the fine-grained details needed for visual reconstruction. Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%.
arXiv Detail & Related papers (2023-06-30T17:59:07Z)
Texts as Images in Prompt Tuning for Multi-Label Image Recognition [70.9310322461598]
We advocate that image-text contrastive learning makes it feasible to treat texts as images for prompt tuning and introduce TaI prompting. Particularly, we apply TaI prompting to multi-label image recognition, where sentences in the wild serve as alternatives to images for prompt tuning. Our proposed TaI-DPT outperforms zero-shot CLIP by a large margin on multiple benchmarks.
arXiv Detail & Related papers (2022-11-23T07:00:11Z)
CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable, and Controllable Text-Guided Face Manipulation [4.078926358349661]
Contrastive Language-Image Pre-Training (CLIP) bridges images and text by embedding them into a joint latent space. Due to the discrepancy between image and text embeddings in the joint space, using text embeddings as the optimization target often introduces undesired artifacts in the resulting images. We introduce CLIP Projection-Augmentation Embedding (PAE) as an optimization target to improve the performance of text-guided image manipulation.
arXiv Detail & Related papers (2022-10-08T05:12:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.