Related papers: DeiSAM: Segment Anything with Deictic Prompting

DeiSAM: Segment Anything with Deictic Prompting

URL: http://arxiv.org/abs/2402.14123v2
Date: Thu, 05 Dec 2024 13:15:34 GMT
Title: DeiSAM: Segment Anything with Deictic Prompting
Authors: Hikaru Shindo, Manuel Brack, Gopika Sudhakaran, Devendra Singh Dhami, Patrick Schramowski, Kristian Kersting,
Abstract summary: DeiSAM is a combination of large pre-trained neural networks with differentiable logic reasoners.<n>It segments objects by matching them to logically inferred image regions.<n>Our empirical results demonstrate that DeiSAM is a substantial improvement over purely data-driven baselines.
Score: 26.38776252198988
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large-scale, pre-trained neural networks have demonstrated strong capabilities in various tasks, including zero-shot image segmentation. To identify concrete objects in complex scenes, humans instinctively rely on deictic descriptions in natural language, i.e., referring to something depending on the context such as "The object that is on the desk and behind the cup.". However, deep learning approaches cannot reliably interpret such deictic representations due to their lack of reasoning capabilities in complex scenarios. To remedy this issue, we propose DeiSAM -- a combination of large pre-trained neural networks with differentiable logic reasoners -- for deictic promptable segmentation. Given a complex, textual segmentation description, DeiSAM leverages Large Language Models (LLMs) to generate first-order logic rules and performs differentiable forward reasoning on generated scene graphs. Subsequently, DeiSAM segments objects by matching them to the logically inferred image regions. As part of our evaluation, we propose the Deictic Visual Genome (DeiVG) dataset, containing paired visual input and complex, deictic textual prompts. Our empirical results demonstrate that DeiSAM is a substantial improvement over purely data-driven baselines for deictic promptable segmentation.

Related papers

Think Before You Segment: High-Quality Reasoning Segmentation with GPT Chain of Thoughts [64.93416171745693]
ThinkFirst is a training-free reasoning segmentation framework. Our approach allows GPT-4o or other powerful MLLMs to generate a detailed, chain-of-thought description of an image. This summarized description is then passed to a language-instructed segmentation assistant to aid the segmentation process.
arXiv Detail & Related papers (2025-03-10T16:26:11Z)
NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning [22.60247555240363]
This paper explores challenges for methods that require reasoning like human cognition. We propose NAVER, a compositional visual grounding method that integrates explicit probabilistic logic reasoning. Our results show that NAVER achieves SoTA performance comparing to recent end-to-end and compositional baselines.
arXiv Detail & Related papers (2025-02-01T09:19:08Z)
SAM-Aware Graph Prompt Reasoning Network for Cross-Domain Few-Shot Segmentation [25.00605325290872]
We propose a SAM-aware graph prompt reasoning network (GPRN) to guide CD-FSS feature representation learning. GPRN transforms masks generated by SAM into visual prompts enriched with high-level semantic information. We show that our method establishes new state-of-the-art results.
arXiv Detail & Related papers (2024-12-31T06:38:49Z)
VISA: Reasoning Video Object Segmentation via Large Language Models [64.33167989521357]
We introduce a new task, Reasoning Video Object (ReasonVOS) This task aims to generate a sequence of segmentation masks in response to implicit text queries that require complex reasoning abilities. We introduce VISA (Video-based large language Instructed Assistant) to tackle ReasonVOS.
arXiv Detail & Related papers (2024-07-16T02:29:29Z)
Visual Prompt Selection for In-Context Learning Segmentation [77.15684360470152]
In this paper, we focus on rethinking and improving the example selection strategy. We first demonstrate that ICL-based segmentation models are sensitive to different contexts. Furthermore, empirical evidence indicates that the diversity of contextual prompts plays a crucial role in guiding segmentation.
arXiv Detail & Related papers (2024-07-14T15:02:54Z)
VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification. In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion. In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z)
CoReS: Orchestrating the Dance of Reasoning and Segmentation [17.767049542947497]
We believe that the act of reasoning segmentation should mirror the cognitive stages of human visual search. We introduce the Chains of Reasoning and Segmenting (CoReS) and find this top-down visual hierarchy indeed enhances the visual search process. Experiments demonstrate the superior performance of our CoReS, which surpasses the state-of-the-art method by 6.5% on the ReasonSeg dataset.
arXiv Detail & Related papers (2024-04-08T16:55:39Z)
LOGICSEG: Parsing Visual Semantics with Neural Logic Learning and Reasoning [73.98142349171552]
LOGICSEG is a holistic visual semantic that integrates neural inductive learning and logic reasoning with both rich data and symbolic knowledge. During fuzzy logic-based continuous relaxation, logical formulae are grounded onto data and neural computational graphs, hence enabling logic-induced network training. These designs together make LOGICSEG a general and compact neural-logic machine that is readily integrated into existing segmentation models.
arXiv Detail & Related papers (2023-09-24T05:43:19Z)
Microscopy Image Segmentation via Point and Shape Regularized Data Synthesis [9.47802391546853]
We develop a unified pipeline for microscopy image segmentation using synthetically generated training data. Our framework achieves comparable results to models trained on authentic microscopy images with dense labels.
arXiv Detail & Related papers (2023-08-18T22:00:53Z)
LISA: Reasoning Segmentation via Large Language Model [68.24075852136761]
We propose a new segmentation task -- reasoning segmentation. The task is designed to output a segmentation mask given a complex and implicit query text. We present LISA: large Language Instructed Assistant, which inherits the language generation capabilities of multimodal Large Language Models.
arXiv Detail & Related papers (2023-08-01T17:50:17Z)
Context Label Learning: Improving Background Class Representations in Semantic Segmentation [23.79946807540805]
We find that neural networks trained with heterogeneous background struggle to map the corresponding contextual samples to compact clusters in feature space. We propose context label learning (CoLab) to improve the context representations by decomposing the background class into several subclasses. The results demonstrate that CoLab can guide the segmentation model to map the logits of background samples away from the decision boundary.
arXiv Detail & Related papers (2022-12-16T11:52:15Z)
Self-Supervised Visual Representation Learning with Semantic Grouping [50.14703605659837]
We tackle the problem of learning visual representations from unlabeled scene-centric data. We propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning.
arXiv Detail & Related papers (2022-05-30T17:50:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.