All in Tokens: Unifying Output Space of Visual Tasks via Soft Token
- URL: http://arxiv.org/abs/2301.02229v1
- Date: Thu, 5 Jan 2023 18:55:20 GMT
- Title: All in Tokens: Unifying Output Space of Visual Tasks via Soft Token
- Authors: Jia Ning, Chen Li, Zheng Zhang, Zigang Geng, Qi Dai, Kun He, Han Hu
- Abstract summary: We show a single unified model that simultaneously handles two typical visual tasks of instance segmentation and depth estimation.
We propose several new techniques that take into account the particularity of visual tasks.
We achieve 0.279 RMSE on the specific task of NYUv2 depth estimation, setting a new record on this benchmark.
- Score: 30.6086480249568
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Unlike language tasks, where the output space is usually limited to a set of
tokens, the output space of visual tasks is more complicated, making it
difficult to build a unified visual model for various visual tasks. In this
paper, we seek to unify the output space of visual tasks, so that we can also
build a unified model for visual tasks. To this end, we demonstrate a single
unified model that simultaneously handles two typical visual tasks of instance
segmentation and depth estimation, which have discrete/fixed-length and
continuous/varied-length outputs, respectively. We propose several new
techniques that take into account the particularity of visual tasks: 1) Soft
token. We employ soft token to represent the task output. Unlike hard tokens in
the common VQ-VAE which are assigned one-hot to discrete
codebooks/vocabularies, the soft token is assigned softly to the codebook
embeddings. Soft token can improve the accuracy of both the next token
inference and decoding of the task output; 2) Mask augmentation. Many visual
tasks have corruption, undefined or invalid values in label annotations, i.e.,
occluded area of depth maps. We show that a mask augmentation technique can
greatly benefit these tasks. With these new techniques and other designs, we
show that the proposed general-purpose task-solver can perform both instance
segmentation and depth estimation well. Particularly, we achieve 0.279 RMSE on
the specific task of NYUv2 depth estimation, setting a new record on this
benchmark. The general-purpose task-solver, dubbed AiT, is available at
\url{https://github.com/SwinTransformer/AiT}.
Related papers
- Task Vectors are Cross-Modal [58.19152818504624]
We investigate the internal representations of vision-and-language models (VLMs)
We consider tasks specified through examples or instructions, using either text or image inputs.
We find that conceptually similar tasks are mapped to similar task vector representations, regardless of how they are specified.
arXiv Detail & Related papers (2024-10-29T17:59:45Z) - Retrieval Replace Reduction: An effective visual token reduction method via semantic match [32.33892531885448]
We introduce textbfTRSM (textbfToken textbfReduction via textbfSemantic textbfMatch), which effectively reduces the number of visual tokens without compromising MLLM performance.
Inspired by how humans process multimodal tasks, TRSM leverages semantic information from one modality to match relevant semantics in another, reducing the number of visual tokens.
Based on experimental results, our approach compresses the visual tokens by 20%, achieving comparable performance across diverse visual question-answering and reasoning tasks.
arXiv Detail & Related papers (2024-10-09T07:13:22Z) - Masked AutoDecoder is Effective Multi-Task Vision Generalist [64.43215311406195]
Masked AutoDecoder(MAD) is an effective multi-task vision generalist.
We develop a parallel decoding framework that introduces bi-directional attention to capture contextual dependencies.
Second, we design a masked sequence modeling approach that learns rich task contexts by masking and reconstructing task sequences.
arXiv Detail & Related papers (2024-03-12T14:36:52Z) - SA$^2$VP: Spatially Aligned-and-Adapted Visual Prompt [59.280491260635266]
Methods for visual prompt tuning follow the sequential modeling paradigm stemming from NLP.
Mymodel model learns a two-dimensional prompt token map with equal (or scaled) size to the image token map.
Our model can conduct individual prompting for different image tokens in a fine-grained manner.
arXiv Detail & Related papers (2023-12-16T08:23:43Z) - A Dynamic Feature Interaction Framework for Multi-task Visual Perception [100.98434079696268]
We devise an efficient unified framework to solve multiple common perception tasks.
These tasks include instance segmentation, semantic segmentation, monocular 3D detection, and depth estimation.
Our proposed framework, termed D2BNet, demonstrates a unique approach to parameter-efficient predictions for multi-task perception.
arXiv Detail & Related papers (2023-06-08T09:24:46Z) - Universal Few-shot Learning of Dense Prediction Tasks with Visual Token
Matching [26.26540176172197]
We propose Visual Token Matching (VTM) as a universal few-shot learner for arbitrary dense prediction tasks.
VTM flexibly adapts to any task with a tiny amount of task-specific parameters that modulate the matching algorithm.
We experiment VTM on a challenging variant of Taskonomy dataset and observe that it robustly few-shot learns various unseen dense prediction tasks.
arXiv Detail & Related papers (2023-03-27T07:58:42Z) - A Unified Sequence Interface for Vision Tasks [87.328893553186]
We show that a diverse set of "core" computer vision tasks can be unified if formulated in terms of a shared pixel-to-sequence interface.
We focus on four tasks, namely, object detection, instance segmentation, keypoint detection, and image captioning, all with diverse types of outputs.
We show that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-specific customization.
arXiv Detail & Related papers (2022-06-15T17:08:53Z) - Vector-Quantized Input-Contextualized Soft Prompts for Natural Language
Understanding [62.45760673220339]
We propose a novel way of prompting, Vector-quantized Input-contextualized Prompt Tuning or VIP.
Over a wide range of natural language understanding tasks, our proposed VIP framework beats the PT model by a margin of 1.19%.
arXiv Detail & Related papers (2022-05-23T03:51:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.