Generalized Decoding for Pixel, Image, and Language
- URL: http://arxiv.org/abs/2212.11270v1
- Date: Wed, 21 Dec 2022 18:58:41 GMT
- Title: Generalized Decoding for Pixel, Image, and Language
- Authors: Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li,
Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang,
Yong Jae Lee, Jianfeng Gao
- Abstract summary: We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly.
X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks.
- Score: 197.85760901840177
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present X-Decoder, a generalized decoding model that can predict
pixel-level segmentation and language tokens seamlessly. X-Decodert takes as
input two types of queries: (i) generic non-semantic queries and (ii) semantic
queries induced from text inputs, to decode different pixel-level and
token-level outputs in the same semantic space. With such a novel design,
X-Decoder is the first work that provides a unified way to support all types of
image segmentation and a variety of vision-language (VL) tasks. Further, our
design enables seamless interactions across tasks at different granularities
and brings mutual benefits by learning a common and rich pixel-level
visual-semantic understanding space, without any pseudo-labeling. After
pretraining on a mixed set of a limited amount of segmentation data and
millions of image-text pairs, X-Decoder exhibits strong transferability to a
wide range of downstream tasks in both zero-shot and finetuning settings.
Notably, it achieves (1) state-of-the-art results on open-vocabulary
segmentation and referring segmentation on eight datasets; (2) better or
competitive finetuned performance to other generalist and specialist models on
segmentation and VL tasks; and (3) flexibility for efficient finetuning and
novel task composition (e.g., referring captioning and image editing). Code,
demo, video, and visualization are available at https://x-decoder-vl.github.io.
Related papers
- OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding [112.87441334765693]
OMG-LLaVA is a new framework combining powerful pixel-level vision understanding with reasoning abilities.
It can accept various visual and text prompts for flexible user interaction.
OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model.
arXiv Detail & Related papers (2024-06-27T17:59:01Z) - Do Vision and Language Encoders Represent the World Similarly? [22.70701869402434]
Aligned text-image encoders such as CLIP have become the de facto model for vision-language tasks.
We find that the representation spaces of unaligned and aligned encoders are semantically similar.
In the absence of statistical similarity in aligned encoders like CLIP, we show that a possible matching of unaligned encoders exists without any training.
arXiv Detail & Related papers (2024-01-10T15:51:39Z) - i-Code V2: An Autoregressive Generation Framework over Vision, Language,
and Speech Data [101.52821120195975]
i-Code V2 is first model capable of generating natural language from any combination of Vision, Language, and Speech data.
System is pretrained end-to-end on a large collection of dual- and single-modality datasets.
arXiv Detail & Related papers (2023-05-21T01:25:44Z) - Linguistic Query-Guided Mask Generation for Referring Image Segmentation [10.130530501400079]
Referring image segmentation aims to segment the image region of interest according to the given language expression.
We propose an end-to-end framework built on transformer to perform Linguistic query-Guided mask generation.
arXiv Detail & Related papers (2023-01-16T13:38:22Z) - OmniVL:One Foundation Model for Image-Language and Video-Language Tasks [117.57580168859512]
We present OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture.
We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer.
We introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together.
arXiv Detail & Related papers (2022-09-15T17:59:59Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.