Towards Unifying Medical Vision-and-Language Pre-training via Soft
Prompts
- URL: http://arxiv.org/abs/2302.08958v1
- Date: Fri, 17 Feb 2023 15:43:42 GMT
- Title: Towards Unifying Medical Vision-and-Language Pre-training via Soft
Prompts
- Authors: Zhihong Chen, Shizhe Diao, Benyou Wang, Guanbin Li, Xiang Wan
- Abstract summary: There exist two typical types, textiti.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used.
We propose an effective yet straightforward scheme named PTUnifier to unify the two types.
We first unify the input format by introducing visual and textual prompts, which serve as a feature bank that stores the most representative images/texts.
- Score: 63.84720380390935
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Medical vision-and-language pre-training (Med-VLP) has shown promising
improvements on many downstream medical tasks owing to its applicability to
extracting generic representations from medical images and texts. Practically,
there exist two typical types, \textit{i.e.}, the fusion-encoder type and the
dual-encoder type, depending on whether a heavy fusion module is used. The
former is superior at multi-modal tasks owing to the sufficient interaction
between modalities; the latter is good at uni-modal and cross-modal tasks due
to the single-modality encoding ability. To take advantage of these two types,
we propose an effective yet straightforward scheme named PTUnifier to unify the
two types. We first unify the input format by introducing visual and textual
prompts, which serve as a feature bank that stores the most representative
images/texts. By doing so, a single model could serve as a \textit{foundation
model} that processes various tasks adopting different input formats
(\textit{i.e.}, image-only, text-only, and image-text-pair). Furthermore, we
construct a prompt pool (instead of static ones) to improve diversity and
scalability. Experimental results show that our approach achieves
state-of-the-art results on a broad range of tasks, spanning uni-modal tasks
(\textit{i.e.}, image/text classification and text summarization), cross-modal
tasks (\textit{i.e.}, image-to-text generation and image-text/text-image
retrieval), and multi-modal tasks (\textit{i.e.}, visual question answering),
demonstrating the effectiveness of our approach. Note that the adoption of
prompts is orthogonal to most existing Med-VLP approaches and could be a
beneficial and complementary extension to these approaches.
Related papers
- Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
Leopard is a vision-language model for handling vision-language tasks involving multiple text-rich images.
First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios.
Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z) - UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion [36.06457895469353]
UNIMO-G is a conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs.
It excels in both text-to-image generation and zero-shot subject-driven synthesis.
arXiv Detail & Related papers (2024-01-24T11:36:44Z) - De-Diffusion Makes Text a Strong Cross-Modal Interface [33.90004746543745]
We employ an autoencoder that uses a pre-trained text-to-image diffusion model for decoding.
Experiments validate the precision and comprehensiveness of De-Diffusion text representing images.
A single De-Diffusion model can generalize to provide transferable prompts for different text-to-image tools.
arXiv Detail & Related papers (2023-11-01T16:12:40Z) - Emu: Generative Pretraining in Multimodality [43.759593451544546]
Transformer-based multimodal foundation model can seamlessly generate images and texts in multimodal context.
Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks.
Emu demonstrates superb performance compared to state-of-the-art large multimodal models.
arXiv Detail & Related papers (2023-07-11T12:45:39Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified
Retrieval and Captioning [66.38951790650887]
Multimodal tasks in the fashion domain have significant potential for e-commerce.
We propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs.
We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks.
arXiv Detail & Related papers (2022-10-26T21:01:19Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Visual Grounding Strategies for Text-Only Natural Language Processing [1.2183405753834562]
multimodal extensions of BERT allow a joint modeling of texts and images that lead to state-of-the-art results on multimodal tasks such as Visual Question Answering.
Here, we leverage multimodal modeling for purely textual tasks with the expectation that the multimodal pretraining provides a grounding that can improve text processing accuracy.
A first type of strategy, referred to as it transferred grounding consists in applying multimodal models to text-only tasks using a placeholder to replace image input.
The second one, which we call it associative grounding, harnesses image retrieval to match texts with related images during both
arXiv Detail & Related papers (2021-03-25T16:03:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.