Hijacking Context in Large Multi-modal Models
- URL: http://arxiv.org/abs/2312.07553v2
- Date: Mon, 13 May 2024 10:42:05 GMT
- Title: Hijacking Context in Large Multi-modal Models
- Authors: Joonhyun Jeong,
- Abstract summary: We introduce a new limitation of off-the-shelf LMMs where a small fraction of incoherent images mislead LMMs to only generate biased output about the hijacked context.
We propose a pre-filtering method that removes irrelevant contexts via GPT-4V, based on its robustness towards distribution shift within the contexts.
We investigate whether replacing the hijacked visual and textual contexts with the correlated ones via GPT-4V and text-to-image models can help yield coherent responses.
- Score: 3.6411220072843866
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, Large Multi-modal Models (LMMs) have demonstrated their ability to understand the visual contents of images given the instructions regarding the images. Built upon the Large Language Models (LLMs), LMMs also inherit their abilities and characteristics such as in-context learning where a coherent sequence of images and texts are given as the input prompt. However, we identify a new limitation of off-the-shelf LMMs where a small fraction of incoherent images or text descriptions mislead LMMs to only generate biased output about the hijacked context, not the originally intended context. To address this, we propose a pre-filtering method that removes irrelevant contexts via GPT-4V, based on its robustness towards distribution shift within the contexts. We further investigate whether replacing the hijacked visual and textual contexts with the correlated ones via GPT-4V and text-to-image models can help yield coherent responses.
Related papers
- Beyond Single Frames: Can LMMs Comprehend Temporal and Contextual Narratives in Image Sequences? [32.61269125015993]
StripCipher is a benchmark designed to evaluate capabilities of Large Multimodal Models (LMMs) to comprehend and reason over sequential images.
StripCipher comprises a human-annotated dataset and three challenging subtasks: visual narrative comprehension, contextual frame prediction, and temporal narrative reordering.
Our evaluation of $16$ state-of-the-art LMMs, including GPT-4o and Qwen2.5VL, reveals a significant performance gap compared to human capabilities.
arXiv Detail & Related papers (2025-02-19T18:04:44Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Boosting Text-To-Image Generation via Multilingual Prompting in Large Multimodal Models [43.16111789538798]
We build parallel multilingual prompts aimed at harnessing the multilingual capabilities of large multimodal models (LMMs)
Experiments on two LMMs across 3 benchmarks show that our method, PMT2I achieves, superior performance in general, compositional, and fine-grained assessments.
arXiv Detail & Related papers (2025-01-13T06:41:23Z) - EntityCLIP: Entity-Centric Image-Text Matching via Multimodal Attentive Contrastive Learning [38.30565103892611]
In this paper, we work towards the textbfEntity-centric textbfImage-textbfText textbfMatching (EITM) problem.
The challenge of this task mainly lies in the larger semantic gap in entity association modeling.
We devise a multimodal attentive contrastive learning framework to adapt EITM problem, developing a model named EntityCLIP.
arXiv Detail & Related papers (2024-10-23T12:12:56Z) - MMR: Evaluating Reading Ability of Large Multimodal Models [52.953316772123586]
Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images.
Current benchmarks fail to accurately reflect performance of different models.
We propose the Multi-Modal Reading (MMR) benchmark in 11 diverse tasks to evaluate LMMs for text-rich image understanding.
arXiv Detail & Related papers (2024-08-26T19:26:50Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - OVMR: Open-Vocabulary Recognition with Multi-Modal References [96.21248144937627]
Existing works have proposed different methods to embed category cues into the model, eg, through few-shot fine-tuning.
This paper tackles open-vocabulary recognition from a different perspective by referring to multi-modal clues composed of textual descriptions and exemplar images.
The proposed OVMR is a plug-and-play module, and works well with exemplar images randomly crawled from the Internet.
arXiv Detail & Related papers (2024-06-07T06:45:28Z) - FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark.
FineMatch focuses on text and image mismatch detection and correction.
We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.