DeiSAM: Segment Anything with Deictic Prompting
- URL: http://arxiv.org/abs/2402.14123v1
- Date: Wed, 21 Feb 2024 20:43:49 GMT
- Title: DeiSAM: Segment Anything with Deictic Prompting
- Authors: Hikaru Shindo, Manuel Brack, Gopika Sudhakaran, Devendra Singh Dhami,
Patrick Schramowski, Kristian Kersting
- Abstract summary: DeiSAM is a combination of large pre-trained neural networks with differentiable logic reasoners.
It segments objects by matching them to logically inferred image regions.
Our empirical results demonstrate that DeiSAM is a substantial improvement over purely data-driven baselines.
- Score: 27.960890657540443
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale, pre-trained neural networks have demonstrated strong
capabilities in various tasks, including zero-shot image segmentation. To
identify concrete objects in complex scenes, humans instinctively rely on
deictic descriptions in natural language, i.e., referring to something
depending on the context such as "The object that is on the desk and behind the
cup.". However, deep learning approaches cannot reliably interpret such deictic
representations due to their lack of reasoning capabilities in complex
scenarios. To remedy this issue, we propose DeiSAM -- a combination of large
pre-trained neural networks with differentiable logic reasoners -- for deictic
promptable segmentation. Given a complex, textual segmentation description,
DeiSAM leverages Large Language Models (LLMs) to generate first-order logic
rules and performs differentiable forward reasoning on generated scene graphs.
Subsequently, DeiSAM segments objects by matching them to the logically
inferred image regions. As part of our evaluation, we propose the Deictic
Visual Genome (DeiVG) dataset, containing paired visual input and complex,
deictic textual prompts. Our empirical results demonstrate that DeiSAM is a
substantial improvement over purely data-driven baselines for deictic
promptable segmentation.
Related papers
- VISA: Reasoning Video Object Segmentation via Large Language Models [64.33167989521357]
We introduce a new task, Reasoning Video Object (ReasonVOS)
This task aims to generate a sequence of segmentation masks in response to implicit text queries that require complex reasoning abilities.
We introduce VISA (Video-based large language Instructed Assistant) to tackle ReasonVOS.
arXiv Detail & Related papers (2024-07-16T02:29:29Z) - Visual Prompt Selection for In-Context Learning Segmentation [77.15684360470152]
In this paper, we focus on rethinking and improving the example selection strategy.
We first demonstrate that ICL-based segmentation models are sensitive to different contexts.
Furthermore, empirical evidence indicates that the diversity of contextual prompts plays a crucial role in guiding segmentation.
arXiv Detail & Related papers (2024-07-14T15:02:54Z) - VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification.
In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion.
In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z) - CoReS: Orchestrating the Dance of Reasoning and Segmentation [17.767049542947497]
We believe that the act of reasoning segmentation should mirror the cognitive stages of human visual search.
We introduce the Chains of Reasoning and Segmenting (CoReS) and find this top-down visual hierarchy indeed enhances the visual search process.
Experiments demonstrate the superior performance of our CoReS, which surpasses the state-of-the-art method by 6.5% on the ReasonSeg dataset.
arXiv Detail & Related papers (2024-04-08T16:55:39Z) - LOGICSEG: Parsing Visual Semantics with Neural Logic Learning and
Reasoning [73.98142349171552]
LOGICSEG is a holistic visual semantic that integrates neural inductive learning and logic reasoning with both rich data and symbolic knowledge.
During fuzzy logic-based continuous relaxation, logical formulae are grounded onto data and neural computational graphs, hence enabling logic-induced network training.
These designs together make LOGICSEG a general and compact neural-logic machine that is readily integrated into existing segmentation models.
arXiv Detail & Related papers (2023-09-24T05:43:19Z) - Microscopy Image Segmentation via Point and Shape Regularized Data
Synthesis [9.47802391546853]
We develop a unified pipeline for microscopy image segmentation using synthetically generated training data.
Our framework achieves comparable results to models trained on authentic microscopy images with dense labels.
arXiv Detail & Related papers (2023-08-18T22:00:53Z) - LISA: Reasoning Segmentation via Large Language Model [68.24075852136761]
We propose a new segmentation task -- reasoning segmentation.
The task is designed to output a segmentation mask given a complex and implicit query text.
We present LISA: large Language Instructed Assistant, which inherits the language generation capabilities of multimodal Large Language Models.
arXiv Detail & Related papers (2023-08-01T17:50:17Z) - Context Label Learning: Improving Background Class Representations in
Semantic Segmentation [23.79946807540805]
We find that neural networks trained with heterogeneous background struggle to map the corresponding contextual samples to compact clusters in feature space.
We propose context label learning (CoLab) to improve the context representations by decomposing the background class into several subclasses.
The results demonstrate that CoLab can guide the segmentation model to map the logits of background samples away from the decision boundary.
arXiv Detail & Related papers (2022-12-16T11:52:15Z) - Self-Supervised Visual Representation Learning with Semantic Grouping [50.14703605659837]
We tackle the problem of learning visual representations from unlabeled scene-centric data.
We propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning.
arXiv Detail & Related papers (2022-05-30T17:50:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.