HyperSeg: Towards Universal Visual Segmentation with Large Language Model
- URL: http://arxiv.org/abs/2411.17606v2
- Date: Mon, 02 Dec 2024 03:19:04 GMT
- Title: HyperSeg: Towards Universal Visual Segmentation with Large Language Model
- Authors: Cong Wei, Yujie Zhong, Haoxian Tan, Yong Liu, Zheng Zhao, Jie Hu, Yujiu Yang,
- Abstract summary: This paper aims to address universal segmentation for image and video perception with the strong reasoning ability empowered by Visual Large Language Models (VLLMs)
We propose HyperSeg, the first VLLM-based universal segmentation model for pixel-level image and video perception.
- Score: 40.326267496697845
- License:
- Abstract: This paper aims to address universal segmentation for image and video perception with the strong reasoning ability empowered by Visual Large Language Models (VLLMs). Despite significant progress in current unified segmentation methods, limitations in adaptation to both image and video scenarios, as well as the complex reasoning segmentation, make it difficult for them to handle various challenging instructions and achieve an accurate understanding of fine-grained vision-language correlations. We propose HyperSeg, the first VLLM-based universal segmentation model for pixel-level image and video perception, encompassing generic segmentation tasks and more complex reasoning perception tasks requiring powerful reasoning abilities and world knowledge. Besides, to fully leverage the recognition capabilities of VLLMs and the fine-grained visual information, HyperSeg incorporates hybrid entity recognition and fine-grained visual perceiver modules for various segmentation tasks. Combined with the temporal adapter, HyperSeg achieves a comprehensive understanding of temporal information. Experimental results validate the effectiveness of our insights in resolving universal image and video segmentation tasks, including the more complex reasoning perception tasks. Our code is available.
Related papers
- Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models [50.98559225639266]
We investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories.
Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance.
We propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions.
arXiv Detail & Related papers (2024-12-26T05:41:31Z) - InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models [37.43195217391341]
In this paper, we define the union of segmentation and reasoning segmentation at both the image and video levels as Instructed Visual (IVS)
Specifically, we employ an object-aware video perceiver to extract temporal and object information from reference frames, facilitating comprehensive video understanding.
By leveraging multi-task and end-to-end training, InstructSeg demonstrates superior performance across diverse image and video segmentation tasks.
arXiv Detail & Related papers (2024-12-18T16:20:40Z) - One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos [41.34787907803329]
VideoLISA is a video-based multimodal large language model designed to tackle the problem of language-instructed reasoning segmentation in videos.
VideoLISA generates temporally consistent segmentation masks in videos based on language instructions.
arXiv Detail & Related papers (2024-09-29T07:47:15Z) - VISA: Reasoning Video Object Segmentation via Large Language Models [64.33167989521357]
We introduce a new task, Reasoning Video Object (ReasonVOS)
This task aims to generate a sequence of segmentation masks in response to implicit text queries that require complex reasoning abilities.
We introduce VISA (Video-based large language Instructed Assistant) to tackle ReasonVOS.
arXiv Detail & Related papers (2024-07-16T02:29:29Z) - OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding [112.87441334765693]
OMG-LLaVA is a new framework combining powerful pixel-level vision understanding with reasoning abilities.
It can accept various visual and text prompts for flexible user interaction.
OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model.
arXiv Detail & Related papers (2024-06-27T17:59:01Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - General Object Foundation Model for Images and Videos at Scale [99.2806103051613]
We present GLEE, an object-level foundation model for locating and identifying objects in images and videos.
GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario.
We employ an image encoder, text encoder, and visual prompter to handle multi-modal inputs, enabling to simultaneously solve various object-centric downstream tasks.
arXiv Detail & Related papers (2023-12-14T17:26:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.