Making the Most of Text Semantics to Improve Biomedical Vision--Language
Processing
- URL: http://arxiv.org/abs/2204.09817v1
- Date: Thu, 21 Apr 2022 00:04:35 GMT
- Title: Making the Most of Text Semantics to Improve Biomedical Vision--Language
Processing
- Authors: Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C. Castro,
Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann,
Aditya Nori, Javier Alvarez-Valle, Hoifung Poon, and Ozan Oktay
- Abstract summary: We show that textual semantic modelling can substantially improve contrastive learning in self-supervised vision--language processing.
We propose a self-supervised joint vision--language approach with a focus on better text modelling.
- Score: 17.96645738679543
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-modal data abounds in biomedicine, such as radiology images and
reports. Interpreting this data at scale is essential for improving clinical
care and accelerating clinical research. Biomedical text with its complex
semantics poses additional challenges in vision-language modelling compared to
the general domain, and previous work has used insufficiently adapted models
that lack domain-specific language understanding. In this paper, we show that
principled textual semantic modelling can substantially improve contrastive
learning in self-supervised vision--language processing. We release a language
model that achieves state-of-the-art results in radiology natural language
inference through its improved vocabulary and novel language pretraining
objective leveraging semantics and discourse characteristics in radiology
reports. Further, we propose a self-supervised joint vision--language approach
with a focus on better text modelling. It establishes new state of the art
results on a wide range of publicly available benchmarks, in part by leveraging
our new domain-specific language model. We release a new dataset with
locally-aligned phrase grounding annotations by radiologists to facilitate the
study of complex semantic modelling in biomedical vision--language processing.
A broad evaluation, including on this new dataset, shows that our contrastive
learning approach, aided by textual-semantic modelling, outperforms prior
methods in segmentation tasks, despite only using a global-alignment objective.
Related papers
- Can a Neural Model Guide Fieldwork? A Case Study on Morphological Inflection [3.48094693551887]
Linguistic fieldwork is an important component in language documentation and preservation.
This paper presents a novel model that guides a linguist during the fieldwork and accounts for the dynamics of linguist-speaker interactions.
arXiv Detail & Related papers (2024-09-22T23:40:03Z) - Pushing the Limits of Vision-Language Models in Remote Sensing without Human Annotations [5.065947993017157]
This study introduces an approach to curate vision-language datasets by employing an image decoding machine learning model.
We amassed approximately 9.6 million vision-language paired datasets in VHR imagery.
The resultant model outperformed counterparts that did not leverage publicly available vision-language datasets.
arXiv Detail & Related papers (2024-09-11T06:36:08Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning [64.1316997189396]
We present a novel language-tied self-supervised learning framework, Hierarchical Language-tied Self-Supervision (HLSS) for histopathology images.
Our resulting model achieves state-of-the-art performance on two medical imaging benchmarks, OpenSRH and TCGA datasets.
arXiv Detail & Related papers (2024-03-21T17:58:56Z) - Semantic Textual Similarity Assessment in Chest X-ray Reports Using a
Domain-Specific Cosine-Based Metric [1.7802147489386628]
We introduce a novel approach designed specifically for assessing the semantic similarity between generated medical reports and the ground truth.
Our approach is validated, demonstrating its efficiency in assessing domain-specific semantic similarity within medical contexts.
arXiv Detail & Related papers (2024-02-19T07:48:25Z) - ECAMP: Entity-centered Context-aware Medical Vision Language Pre-training [21.315060059765894]
We propose a novel framework for entity-centered medical vision-language pre-training.
We distill entity-centered context from medical reports to gain more effective supervision from the text modality.
Our proposed multi-scale context fusion design also improves the semantic integration of both coarse and fine-level image representations.
arXiv Detail & Related papers (2023-12-20T11:00:54Z) - Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models [0.8878802873945023]
This study introduces the first systematic study on transferring Vision-Language Models to 2D medical images.
Although VLSMs show competitive performance compared to image-only models for segmentation, not all VLSMs utilize the additional information from language prompts.
arXiv Detail & Related papers (2023-08-15T11:28:21Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Morphologically Aware Word-Level Translation [82.59379608647147]
We propose a novel morphologically aware probability model for bilingual lexicon induction.
Our model exploits the basic linguistic intuition that the lexeme is the key lexical unit of meaning.
arXiv Detail & Related papers (2020-11-15T17:54:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.