Related papers: Leveraging Hallucinations to Reduce Manual Prompt Dependency in Promptable Segmentation

Leveraging Hallucinations to Reduce Manual Prompt Dependency in Promptable Segmentation

URL: http://arxiv.org/abs/2408.15205v3
Date: Fri, 22 Nov 2024 17:27:31 GMT
Title: Leveraging Hallucinations to Reduce Manual Prompt Dependency in Promptable Segmentation
Authors: Jian Hu, Jiayi Lin, Junchi Yan, Shaogang Gong,
Abstract summary: We introduce an iterative Prompt-Mask Cycle generation framework (ProMaC) with a prompt generator and a mask generator. The prompt generator uses a multi-scale chain of thought prompting, initially exploring hallucinations for extracting extended contextual knowledge on a test image. The generated masks iteratively induce the prompt generator to focus more on task-relevant image areas and reduce irrelevant hallucinations, resulting jointly in better prompts and masks.
Score: 74.04806143723597
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Promptable segmentation typically requires instance-specific manual prompts to guide the segmentation of each desired object. To minimize such a need, task-generic promptable segmentation has been introduced, which employs a single task-generic prompt to segment various images of different objects in the same task. Current methods use Multimodal Large Language Models (MLLMs) to reason detailed instance-specific prompts from a task-generic prompt for improving segmentation accuracy. The effectiveness of this segmentation heavily depends on the precision of these derived prompts. However, MLLMs often suffer hallucinations during reasoning, resulting in inaccurate prompting. While existing methods focus on eliminating hallucinations to improve a model, we argue that MLLM hallucinations can reveal valuable contextual insights when leveraged correctly, as they represent pre-trained large-scale knowledge beyond individual images. In this paper, we utilize hallucinations to mine task-related information from images and verify its accuracy for enhancing precision of the generated prompts. Specifically, we introduce an iterative Prompt-Mask Cycle generation framework (ProMaC) with a prompt generator and a mask generator.The prompt generator uses a multi-scale chain of thought prompting, initially exploring hallucinations for extracting extended contextual knowledge on a test image.These hallucinations are then reduced to formulate precise instance-specific prompts, directing the mask generator to produce masks that are consistent with task semantics by mask semantic alignment. The generated masks iteratively induce the prompt generator to focus more on task-relevant image areas and reduce irrelevant hallucinations, resulting jointly in better prompts and masks. Experiments on 5 benchmarks demonstrate the effectiveness of ProMaC. Code given in https://lwpyh.github.io/ProMaC/.

Related papers

Towards General Visual-Linguistic Face Forgery Detection(V2) [90.6600794602029]
Face manipulation techniques have achieved significant advances, presenting serious challenges to security and social trust. Recent works demonstrate that leveraging multimodal models can enhance the generalization and interpretability of face forgery detection. We propose Face Forgery Text Generator (FFTG), a novel annotation pipeline that generates accurate text descriptions by leveraging forgery masks for initial region and type identification.
arXiv Detail & Related papers (2025-02-28T04:15:36Z)
INT: Instance-Specific Negative Mining for Task-Generic Promptable Segmentation [31.734740711205227]
We introduce textbfInstance-specific textbfNegative Mining for textbfTask-Generic Promptable (textbfINT) Int consists of two components: (1) instance-specific prompt generation, which progressively fliters out incorrect information in prompt generation; (2) semantic mask generation, which ensures each image instance segmentation matches correctly the semantics of the instance-specific prompts. Int is validated on six datasets, including camouflaged objects and medical images, demonstrating its effectiveness, robustness and scalability.
arXiv Detail & Related papers (2025-01-30T21:07:14Z)
Prompt-Guided Mask Proposal for Two-Stage Open-Vocabulary Segmentation [21.30568336073013]
We tackle the challenge of open-vocabulary segmentation, where we need to identify objects from a wide range of categories in different environments. Existing methods often use multi-modal models like CLIP, which combine image and text features in a shared embedding space. We propose Prompt-guided Mask Proposal (PMP) where the mask generator takes the input text prompts and generates masks guided by these prompts.
arXiv Detail & Related papers (2024-12-13T17:22:50Z)
ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization [49.992614129625274]
ForgeryGPT is a novel framework that advances the Image Forgery Detection and localization task. It captures high-order correlations of forged images from diverse linguistic feature spaces. It enables explainable generation and interactive dialogue through a newly customized Large Language Model (LLM) architecture.
arXiv Detail & Related papers (2024-10-14T07:56:51Z)
DQFormer: Towards Unified LiDAR Panoptic Segmentation with Decoupled Queries [14.435906383301555]
We propose a novel framework dubbed DQFormer to implement semantic and instance segmentation in a unified workflow. Specifically, we design a decoupled query generator to propose informative queries with semantics by localizing things/stuff positions. We also introduce a query-oriented mask decoder to decode corresponding segmentation masks.
arXiv Detail & Related papers (2024-08-28T14:14:33Z)
GSVA: Generalized Segmentation via Multimodal Large Language Models [72.57095903188922]
Generalized Referring Expression (GRES) extends the scope of classic RES to refer to multiple objects in one expression or identify the empty targets absent in the image. Current solutions to GRES remain unsatisfactory since segmentation MLLMs cannot correctly handle the cases where users might reference multiple subjects in a singular prompt. We propose Generalized Vision Assistant (GSVA) to address this gap.
arXiv Detail & Related papers (2023-12-15T02:54:31Z)
Relax Image-Specific Prompt Requirement in SAM: A Single Generic Prompt for Segmenting Camouflaged Objects [32.14438610147615]
We introduce a test-time adaptation per-instance mechanism called Generalizable SAM (GenSAM) to automatically enerate and optimize visual prompts. Experiments on three benchmarks demonstrate that GenSAM outperforms point supervision approaches.
arXiv Detail & Related papers (2023-12-12T15:43:36Z)
Segment (Almost) Nothing: Prompt-Agnostic Adversarial Attacks on Segmentation Models [61.46999584579775]
General purpose segmentation models are able to generate (semantic) segmentation masks from a variety of prompts. In particular, input images are pre-processed by an image encoder to obtain embedding vectors which are later used for mask predictions. We show that even imperceptible perturbations of radius $epsilon=1/255$ are often sufficient to drastically modify the masks predicted with point, box and text prompts.
arXiv Detail & Related papers (2023-11-24T12:57:34Z)
LISA: Reasoning Segmentation via Large Language Model [68.24075852136761]
We propose a new segmentation task -- reasoning segmentation. The task is designed to output a segmentation mask given a complex and implicit query text. We present LISA: large Language Instructed Assistant, which inherits the language generation capabilities of multimodal Large Language Models.
arXiv Detail & Related papers (2023-08-01T17:50:17Z)
Position-based Prompting for Health Outcome Generation [0.0]
We explore an idea of using a position-attention mechanism to capture positional information of each word in a prompt relative to the mask to be filled. Our approach consistently outperforms a baseline in which the default mask language model (MLM) representation is used to predict masked tokens.
arXiv Detail & Related papers (2022-03-30T16:44:04Z)
Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling [61.03262873980619]
Open-vocabulary instance segmentation aims at segmenting novel classes without mask annotations. We propose a cross-modal pseudo-labeling framework, which generates training pseudo masks by aligning word semantics in captions with visual features of object masks in images. Our framework is capable of labeling novel classes in captions via their word semantics to self-train a student model.
arXiv Detail & Related papers (2021-11-24T18:50:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.