LISA: Reasoning Segmentation via Large Language Model
- URL: http://arxiv.org/abs/2308.00692v3
- Date: Wed, 1 May 2024 05:10:13 GMT
- Title: LISA: Reasoning Segmentation via Large Language Model
- Authors: Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, Jiaya Jia,
- Abstract summary: We propose a new segmentation task -- reasoning segmentation.
The task is designed to output a segmentation mask given a complex and implicit query text.
We present LISA: large Language Instructed Assistant, which inherits the language generation capabilities of multimodal Large Language Models.
- Score: 68.24075852136761
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Although perception systems have made remarkable advancements in recent years, they still rely on explicit human instruction or pre-defined categories to identify the target objects before executing visual recognition tasks. Such systems cannot actively reason and comprehend implicit user intention. In this work, we propose a new segmentation task -- reasoning segmentation. The task is designed to output a segmentation mask given a complex and implicit query text. Furthermore, we establish a benchmark comprising over one thousand image-instruction-mask data samples, incorporating intricate reasoning and world knowledge for evaluation purposes. Finally, we present LISA: large Language Instructed Segmentation Assistant, which inherits the language generation capabilities of multimodal Large Language Models (LLMs) while also possessing the ability to produce segmentation masks. We expand the original vocabulary with a <SEG> token and propose the embedding-as-mask paradigm to unlock the segmentation capability. Remarkably, LISA can handle cases involving complex reasoning and world knowledge. Also, it demonstrates robust zero-shot capability when trained exclusively on reasoning-free datasets. In addition, fine-tuning the model with merely 239 reasoning segmentation data samples results in further performance enhancement. Both quantitative and qualitative experiments show our method effectively unlocks new reasoning segmentation capabilities for multimodal LLMs. Code, models, and data are available at https://github.com/dvlab-research/LISA.
Related papers
- ViLLa: Video Reasoning Segmentation with Large Language Model [48.75470418596875]
We propose a new video segmentation task - video reasoning segmentation.
The task is designed to output tracklets of segmentation masks given a complex input text query.
We present ViLLa: Video reasoning segmentation with a Large Language Model.
arXiv Detail & Related papers (2024-07-18T17:59:17Z) - VISA: Reasoning Video Object Segmentation via Large Language Models [64.33167989521357]
We introduce a new task, Reasoning Video Object (ReasonVOS)
This task aims to generate a sequence of segmentation masks in response to implicit text queries that require complex reasoning abilities.
We introduce VISA (Video-based large language Instructed Assistant) to tackle ReasonVOS.
arXiv Detail & Related papers (2024-07-16T02:29:29Z) - LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning [8.379286663107845]
Reasoning segmentation is a novel task that enables segmentation system to reason and interpret implicit user intention.
Our work on reasoning segmentation contributes on both the methodological design and dataset labeling.
arXiv Detail & Related papers (2024-04-12T18:45:51Z) - LaSagnA: Language-based Segmentation Assistant for Complex Queries [39.620806493454616]
Large Language Models for Vision (vLLMs) generate detailed perceptual outcomes, including bounding boxes and masks.
In this study, we acknowledge that the main cause of these problems is the insufficient complexity of training queries.
We present three novel strategies to effectively handle the challenges arising from the direct integration of the proposed format.
arXiv Detail & Related papers (2024-04-12T14:40:45Z) - DeiSAM: Segment Anything with Deictic Prompting [27.960890657540443]
DeiSAM is a combination of large pre-trained neural networks with differentiable logic reasoners.
It segments objects by matching them to logically inferred image regions.
Our empirical results demonstrate that DeiSAM is a substantial improvement over purely data-driven baselines.
arXiv Detail & Related papers (2024-02-21T20:43:49Z) - LISA++: An Improved Baseline for Reasoning Segmentation with Large
Language Model [54.850048630298495]
We introduce LISA++, an update to the existing LISA model, focusing on improving core functionalities while keeping the base architecture intact.
The instance segmentation ability has been added, providing a more detailed scene analysis along with the existing multi-region semantic segmentation.
These improvements are achieved by curating existing samples of generic segmentation datasets aimed specifically at enhancing the segmentation and conversational skills without structural change and additional data sources.
arXiv Detail & Related papers (2023-12-28T18:58:33Z) - GSVA: Generalized Segmentation via Multimodal Large Language Models [72.57095903188922]
Generalized Referring Expression (GRES) extends the scope of classic RES to refer to multiple objects in one expression or identify the empty targets absent in the image.
Current solutions to GRES remain unsatisfactory since segmentation MLLMs cannot correctly handle the cases where users might reference multiple subjects in a singular prompt.
We propose Generalized Vision Assistant (GSVA) to address this gap.
arXiv Detail & Related papers (2023-12-15T02:54:31Z) - Exploring the Potential of Large Language Models in Computational Argumentation [54.85665903448207]
Large language models (LLMs) have demonstrated impressive capabilities in understanding context and generating natural language.
This work aims to embark on an assessment of LLMs, such as ChatGPT, Flan models, and LLaMA2 models, in both zero-shot and few-shot settings.
arXiv Detail & Related papers (2023-11-15T15:12:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.