InstructSeq: Unifying Vision Tasks with Instruction-conditioned
Multi-modal Sequence Generation
- URL: http://arxiv.org/abs/2311.18835v1
- Date: Thu, 30 Nov 2023 18:59:51 GMT
- Title: InstructSeq: Unifying Vision Tasks with Instruction-conditioned
Multi-modal Sequence Generation
- Authors: Rongyao Fang, Shilin Yan, Zhaoyang Huang, Jingqiu Zhou, Hao Tian,
Jifeng Dai, Hongsheng Li
- Abstract summary: InstructSeq is an instruction-conditioned multi-modal modeling framework.
It unifies diverse vision tasks through flexible natural language control and handling of both visual and textual data.
- Score: 59.24938416319019
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Empowering models to dynamically accomplish tasks specified through natural
language instructions represents a promising path toward more capable and
general artificial intelligence. In this work, we introduce InstructSeq, an
instruction-conditioned multi-modal modeling framework that unifies diverse
vision tasks through flexible natural language control and handling of both
visual and textual data. InstructSeq employs a multimodal transformer
architecture encompassing visual, language, and sequential modeling. We utilize
a visual encoder to extract image features and a text encoder to encode
instructions. An autoregressive transformer fuses the representations and
generates sequential task outputs. By training with LLM-generated natural
language instructions, InstructSeq acquires a strong comprehension of free-form
instructions for specifying visual tasks. This provides an intuitive interface
for directing capabilities using flexible natural instructions. Without any
task-specific tuning, InstructSeq achieves compelling performance on semantic
segmentation, referring expression segmentation/comprehension, and image
captioning. The flexible control and multi-task unification empower the model
with more human-like versatility and generalizability for computer vision. The
code will be released soon at https://github.com/rongyaofang/InstructSeq.
Related papers
- TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning [54.033346088090674]
We introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability.
To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT.
This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions [66.92809850624118]
PixWizard is an image-to-image visual assistant designed for image generation, manipulation, and translation based on free-from language instructions.
We tackle a variety of vision tasks into a unified image-text-to-image generation framework and curate an Omni Pixel-to-Pixel Instruction-Tuning dataset.
Our experiments demonstrate that PixWizard not only shows impressive generative and understanding abilities for images with diverse resolutions but also exhibits promising generalization capabilities with unseen tasks and human instructions.
arXiv Detail & Related papers (2024-09-23T17:59:46Z) - InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following [26.457571615782985]
InstructAny2Pix is a flexible multi-modal instruction-following system that enables users to edit an input image using instructions involving audio, images, and text.
We demonstrate that our system can perform a series of novel instruction-guided editing tasks.
arXiv Detail & Related papers (2023-12-11T17:53:45Z) - InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists [66.85125112199898]
We develop a unified language interface for computer vision tasks that abstracts away task-specific design choices.
Our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models.
arXiv Detail & Related papers (2023-09-30T14:26:43Z) - Valley: Video Assistant with Large Language model Enhanced abilitY [41.79449203718827]
We introduce Valley, a Video Assistant with Large Language model Enhanced abilitY.
To empower Valley with video comprehension and instruction-following capabilities, we construct a video instruction dataset.
We employ ChatGPT to facilitate the construction of task-oriented conversation data.
arXiv Detail & Related papers (2023-06-12T16:11:10Z) - MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks [59.09343552273045]
We propose a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.
We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks.
Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models.
arXiv Detail & Related papers (2023-03-29T16:42:30Z) - Instruction-Following Agents with Multimodal Transformer [95.70039658112873]
We propose a simple yet effective model for robots to solve instruction-following tasks in vision-based environments.
Our method consists of a multimodal transformer that encodes visual observations and language instructions.
We show that this unified transformer model outperforms all state-of-the-art pre-trained or trained-from-scratch methods in both single-task and multi-task settings.
arXiv Detail & Related papers (2022-10-24T17:46:47Z) - Unifying Vision-and-Language Tasks via Text Generation [81.3910771082967]
We propose a unified framework that learns different tasks in a single architecture.
Our models learn to generate labels in text based on the visual and textual inputs.
Our generative approach shows better generalization ability on answering questions that have rare answers.
arXiv Detail & Related papers (2021-02-04T17:59:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.