Related papers: Tokenize Anything via Prompting

Tokenize Anything via Prompting

URL: http://arxiv.org/abs/2312.09128v2
Date: Wed, 17 Jul 2024 04:34:37 GMT
Title: Tokenize Anything via Prompting
Authors: Ting Pan, Lulu Tang, Xinlong Wang, Shiguang Shan,
Abstract summary: We present a unified, promptable model capable of simultaneously segmenting, recognizing, and captioning anything. We train a generalizable model with massive segmentation masks, eg, SA-1B masks, and semantic priors from a pre-trained CLIP model with 5 billion parameters. We believe this model can be a versatile region-level image tokenizer, capable of encoding general-purpose region context.
Score: 65.93061853439512
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a unified, promptable model capable of simultaneously segmenting, recognizing, and captioning anything. Unlike SAM, we aim to build a versatile region representation in the wild via visual prompting. To achieve this, we train a generalizable model with massive segmentation masks, \eg, SA-1B masks, and semantic priors from a pre-trained CLIP model with 5 billion parameters. Specifically, we construct a promptable image decoder by adding a semantic token to each mask token. The semantic token is responsible for learning the semantic priors in a predefined concept space. Through joint optimization of segmentation on mask tokens and concept prediction on semantic tokens, our model exhibits strong regional recognition and localization capabilities. For example, an additional 38M-parameter causal text decoder trained from scratch sets a new record with a CIDEr score of 164.7 on the Visual Genome region captioning task. We believe this model can be a versatile region-level image tokenizer, capable of encoding general-purpose region context for a broad range of visual perception tasks. Code and models are available at {\footnotesize \url{https://github.com/baaivision/tokenize-anything}}.

Related papers

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks [59.12788703213031]
We present Omni-RGPT, a large language model designed to facilitate region-level comprehension for both images and videos. We introduce Token Mark, a set of tokens highlighting the target regions within the visual-temporal feature space. We also introduce a large-scale region-level video instruction dataset (VID-300k)
arXiv Detail & Related papers (2025-01-14T18:58:04Z)
MaskInversion: Localized Embeddings via Optimization of Explainability Maps [49.50785637749757]
MaskInversion generates a context-aware embedding for a query image region specified by a mask at test time. It can be used for a broad range of tasks, including open-vocabulary class retrieval, referring expression comprehension, as well as for localized captioning and image generation.
arXiv Detail & Related papers (2024-07-29T14:21:07Z)
Boosting Unsupervised Semantic Segmentation with Principal Mask Proposals [15.258631373740686]
Unsupervised semantic segmentation aims to automatically partition images into semantically meaningful regions by identifying global semantic categories within an image corpus without any form of annotation. We present PriMaPs - Principal Mask Proposals - decomposing images into semantically meaningful masks based on their feature representation. This allows us to realize unsupervised semantic segmentation by fitting class prototypes to PriMaPs with a expectation-maximization algorithm, PriMaPs-EM.
arXiv Detail & Related papers (2024-04-25T17:58:09Z)
SA$^2$VP: Spatially Aligned-and-Adapted Visual Prompt [59.280491260635266]
Methods for visual prompt tuning follow the sequential modeling paradigm stemming from NLP. Mymodel model learns a two-dimensional prompt token map with equal (or scaled) size to the image token map. Our model can conduct individual prompting for different image tokens in a fine-grained manner.
arXiv Detail & Related papers (2023-12-16T08:23:43Z)
Leveraging per Image-Token Consistency for Vision-Language Pre-training [52.825150269820696]
Cross-modal masked language modeling (CMLM) is insufficient for vision-language pre-training. We propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training) The proposed EPIC method is easily combined with pre-training methods.
arXiv Detail & Related papers (2022-11-20T12:10:53Z)
Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP [45.81698881151867]
Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, to classify masked regions. We propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions. In particular, when trained on COCO and evaluated on ADE20K-150, our best model achieves 29.6% mIoU, which is +8.5% higher than the previous state-
arXiv Detail & Related papers (2022-10-09T02:57:32Z)
SdAE: Self-distillated Masked Autoencoder [95.3684955370897]
Self-distillated masked AutoEncoder network SdAE is proposed in this paper. With only 300 epochs pre-training, a vanilla ViT-Base model achieves an 84.1% fine-tuning accuracy on ImageNet-1k classification.
arXiv Detail & Related papers (2022-07-31T15:07:25Z)
Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels. We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction. We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z)
iBOT: Image BERT Pre-Training with Online Tokenizer [23.997853010642046]
We study masked image modeling (MIM) and indicate the advantages and challenges of using a semantically meaningful visual tokenizer. We present a self-supervised framework iBOT that can perform masked prediction with an online tokenizer. We show the prominence of iBOT by achieving an 81.6% linear probing accuracy and an 86.3% fine-tuning accuracy evaluated on ImageNet-1K.
arXiv Detail & Related papers (2021-11-15T15:18:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.