Unifying Vision-and-Language Tasks via Text Generation
- URL: http://arxiv.org/abs/2102.02779v1
- Date: Thu, 4 Feb 2021 17:59:30 GMT
- Title: Unifying Vision-and-Language Tasks via Text Generation
- Authors: Jaemin Cho, Jie Lei, Hao Tan, Mohit Bansal
- Abstract summary: We propose a unified framework that learns different tasks in a single architecture.
Our models learn to generate labels in text based on the visual and textual inputs.
Our generative approach shows better generalization ability on answering questions that have rare answers.
- Score: 81.3910771082967
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing methods for vision-and-language learning typically require designing
task-specific architectures and objectives for each task. For example, a
multi-label answer classifier for visual question answering, a region scorer
for referring expression comprehension, and a language decoder for image
captioning, etc. To alleviate these hassles, in this work, we propose a unified
framework that learns different tasks in a single architecture with the same
language modeling objective, i.e., multimodal conditional text generation,
where our models learn to generate labels in text based on the visual and
textual inputs. On 7 popular vision-and-language benchmarks, including visual
question answering, referring expression comprehension, visual commonsense
reasoning, most of which have been previously modeled as discriminative tasks,
our generative approach (with a single unified architecture) reaches comparable
performance to recent task-specific state-of-the-art vision-and-language
models. Moreover, our generative approach shows better generalization ability
on answering questions that have rare answers. In addition, we show that our
framework allows multi-task learning in a single architecture with a single set
of parameters, which achieves similar performance to separately optimized
single-task models. Our code will be publicly available at:
https://github.com/j-min/VL-T5
Related papers
- MiniGPT-v2: large language model as a unified interface for
vision-language multi-task learning [65.60607895153692]
MiniGPT-v2 is a model that can be treated as a unified interface for better handling various vision-language tasks.
We propose using unique identifiers for different tasks when training the model.
Our results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks.
arXiv Detail & Related papers (2023-10-14T03:22:07Z) - PaLI-X: On Scaling up a Multilingual Vision and Language Model [166.9837904115951]
We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model.
Our model achieves new levels of performance on a wide-range of varied and complex tasks.
We observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.
arXiv Detail & Related papers (2023-05-29T18:58:38Z) - Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks [39.12025963907317]
Unified-IO is a model that performs a large variety of AI tasks spanning classical computer vision tasks.
We achieve this unification by homogenizing every supported input and output into a sequence of discrete vocabulary tokens.
Unified-IO is the first model capable of performing all 7 tasks on the GRIT benchmark.
arXiv Detail & Related papers (2022-06-17T17:53:47Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding [35.01174511816063]
We present a novel method, named Pseudo-Q, to automatically generate pseudo language queries for supervised training.
Our method leverages an off-the-shelf object detector to identify visual objects from unlabeled images.
We develop a visual-language model equipped with multi-level cross-modality attention mechanism.
arXiv Detail & Related papers (2022-03-16T09:17:41Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z) - All-in-One Image-Grounded Conversational Agents [31.28974522911758]
We design an architecture that combines state-of-the-art Transformer and ResNeXt modules fed into a novel attentive multimodal module.
We provide a thorough analysis of the components of the model, and transfer performance when training on one, some, or all of the tasks.
arXiv Detail & Related papers (2019-12-28T03:51:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.