Related papers: Unifying Vision-and-Language Tasks via Text Generation

Unifying Vision-and-Language Tasks via Text Generation

URL: http://arxiv.org/abs/2102.02779v1
Date: Thu, 4 Feb 2021 17:59:30 GMT
Title: Unifying Vision-and-Language Tasks via Text Generation
Authors: Jaemin Cho, Jie Lei, Hao Tan, Mohit Bansal
Abstract summary: We propose a unified framework that learns different tasks in a single architecture. Our models learn to generate labels in text based on the visual and textual inputs. Our generative approach shows better generalization ability on answering questions that have rare answers.
Score: 81.3910771082967
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing methods for vision-and-language learning typically require designing task-specific architectures and objectives for each task. For example, a multi-label answer classifier for visual question answering, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc. To alleviate these hassles, in this work, we propose a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where our models learn to generate labels in text based on the visual and textual inputs. On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models. Moreover, our generative approach shows better generalization ability on answering questions that have rare answers. In addition, we show that our framework allows multi-task learning in a single architecture with a single set of parameters, which achieves similar performance to separately optimized single-task models. Our code will be publicly available at: https://github.com/j-min/VL-T5

Related papers

UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface [25.898592418636603]
ours is a framework that textbfUnifies textbfFine-grained visual perception tasks through an textbfOpen-ended language interface. ours unifies object-level detection, pixel-level segmentation, and image-level vision-language tasks into a single model. Our framework bridges the gap between fine-grained perception and vision-language tasks, significantly simplifying architectural design and training strategies.
arXiv Detail & Related papers (2025-03-03T09:27:24Z)
ABC: Achieving Better Control of Multimodal Embeddings using VLMs [61.396457715710774]
Visual embedding models excel at zero-shot tasks like visual retrieval and classification. Existing CLIP-based approaches embed images and text independently, and fuse the result. We introduce ABC, an open-source multimodal embedding model that uses a vision-language model backbone.
arXiv Detail & Related papers (2025-03-01T03:29:02Z)
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning [65.60607895153692]
MiniGPT-v2 is a model that can be treated as a unified interface for better handling various vision-language tasks. We propose using unique identifiers for different tasks when training the model. Our results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks.
arXiv Detail & Related papers (2023-10-14T03:22:07Z)
PaLI-X: On Scaling up a Multilingual Vision and Language Model [166.9837904115951]
We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model. Our model achieves new levels of performance on a wide-range of varied and complex tasks. We observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.
arXiv Detail & Related papers (2023-05-29T18:58:38Z)
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks [39.12025963907317]
Unified-IO is a model that performs a large variety of AI tasks spanning classical computer vision tasks. We achieve this unification by homogenizing every supported input and output into a sequence of discrete vocabulary tokens. Unified-IO is the first model capable of performing all 7 tasks on the GRIT benchmark.
arXiv Detail & Related papers (2022-06-17T17:53:47Z)
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs. Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z)
Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding [35.01174511816063]
We present a novel method, named Pseudo-Q, to automatically generate pseudo language queries for supervised training. Our method leverages an off-the-shelf object detector to identify visual objects from unlabeled images. We develop a visual-language model equipped with multi-level cross-modality attention mechanism.
arXiv Detail & Related papers (2022-03-16T09:17:41Z)
Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images. "vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora. Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z)
Probing Contextual Language Models for Common Ground with Visual Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations. Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories. Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)
All-in-One Image-Grounded Conversational Agents [31.28974522911758]
We design an architecture that combines state-of-the-art Transformer and ResNeXt modules fed into a novel attentive multimodal module. We provide a thorough analysis of the components of the model, and transfer performance when training on one, some, or all of the tasks.
arXiv Detail & Related papers (2019-12-28T03:51:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.