LlamaSeg: Image Segmentation via Autoregressive Mask Generation
- URL: http://arxiv.org/abs/2505.19422v1
- Date: Mon, 26 May 2025 02:22:41 GMT
- Title: LlamaSeg: Image Segmentation via Autoregressive Mask Generation
- Authors: Jiru Deng, Tengjin Weng, Tianyu Yang, Wenhan Luo, Zhiheng Li, Wenhao Jiang,
- Abstract summary: We present LlamaSeg, a visual autoregressive framework that unifies multiple image segmentation tasks via natural language instructions.<n>We reformulate image segmentation as a visual generation problem, representing masks as "visual" tokens and employing a LLaMA-style Transformer to predict them directly from image inputs.
- Score: 46.17509085054758
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present LlamaSeg, a visual autoregressive framework that unifies multiple image segmentation tasks via natural language instructions. We reformulate image segmentation as a visual generation problem, representing masks as "visual" tokens and employing a LLaMA-style Transformer to predict them directly from image inputs. By adhering to the next-token prediction paradigm, our approach naturally integrates segmentation tasks into autoregressive architectures. To support large-scale training, we introduce a data annotation pipeline and construct the SA-OVRS dataset, which contains 2M segmentation masks annotated with over 5,800 open-vocabulary labels or diverse textual descriptions, covering a wide spectrum of real-world scenarios. This enables our model to localize objects in images based on text prompts and to generate fine-grained masks. To more accurately evaluate the quality of masks produced by visual generative models, we further propose a composite metric that combines Intersection over Union (IoU) with Average Hausdorff Distance (AHD), offering a more precise assessment of contour fidelity. Experimental results demonstrate that our method surpasses existing generative models across multiple datasets and yields more detailed segmentation masks.
Related papers
- Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval [13.296362770269452]
Mask-aware TIR (MaTIR) aims to find relevant images based on a textual query.<n>We propose a two-stage framework, comprising a first stage for segmentation-aware image retrieval and a second stage for reranking and object grounding.<n>We evaluate our approach on COCO and D$3$ datasets, demonstrating significant improvements in both retrieval accuracy and segmentation quality over previous methods.
arXiv Detail & Related papers (2025-06-28T12:19:49Z) - HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model [6.641903410779405]
We propose the Hierarchical Mask Tokenizer (HiMTok), which represents segmentation masks with up to 32 tokens.<n>HiMTok allows for compact and coarse-to-fine mask representations, aligning well with the next-token-prediction paradigm.<n>We develop a 3-stage training recipe for progressive learning of segmentation and visual capabilities, featuring a hierarchical mask loss for effective coarse-to-fine learning.
arXiv Detail & Related papers (2025-03-17T10:29:08Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing [22.729750410621826]
GeoPix is a RS MLLM that extends image understanding capabilities to the pixel level.<n>To facilitate the segmentation of multi-scale objects in RS imagery, a class-wise learnable memory module is integrated into the mask predictor.<n>To address the absence of large-scale datasets for training pixel-level RS MLLMs, we construct the GeoPixInstruct dataset.
arXiv Detail & Related papers (2025-01-12T14:45:27Z) - Semantic Alignment for Multimodal Large Language Models [72.10272479476161]
We introduce Semantic Alignment for Multi-modal large language models (SAM)
By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis.
By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis.
arXiv Detail & Related papers (2024-08-23T06:48:46Z) - Generalizable Entity Grounding via Assistance of Large Language Model [77.07759442298666]
We propose a novel approach to densely ground visual entities from a long caption.
We leverage a large multimodal model to extract semantic nouns, a class-a segmentation model to generate entity-level segmentation, and a multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask.
arXiv Detail & Related papers (2024-02-04T16:06:05Z) - UniGS: Unified Representation for Image Generation and Segmentation [105.08152635402858]
We use a colormap to represent entity-level masks, addressing the challenge of varying entity numbers.
Two novel modules, including the location-aware color palette and progressive dichotomy module, are proposed to support our mask representation.
arXiv Detail & Related papers (2023-12-04T15:59:27Z) - High-Quality Entity Segmentation [110.55724145851725]
CropFormer is designed to tackle the intractability of instance-level segmentation on high-resolution images.
It improves mask prediction by fusing high-res image crops that provide more fine-grained image details and the full image.
With CropFormer, we achieve a significant AP gain of $1.9$ on the challenging entity segmentation task.
arXiv Detail & Related papers (2022-11-10T18:58:22Z) - Autoregressive Unsupervised Image Segmentation [8.894935073145252]
We propose a new unsupervised image segmentation approach based on mutual information between different views constructed of the inputs.
The proposed method outperforms current state-of-the-art on unsupervised image segmentation.
arXiv Detail & Related papers (2020-07-16T10:47:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.