Transferable Visual Words: Exploiting the Semantics of Anatomical
Patterns for Self-supervised Learning
- URL: http://arxiv.org/abs/2102.10680v1
- Date: Sun, 21 Feb 2021 20:44:55 GMT
- Title: Transferable Visual Words: Exploiting the Semantics of Anatomical
Patterns for Self-supervised Learning
- Authors: Fatemeh Haghighi, Mohammad Reza Hosseinzadeh Taher, Zongwei Zhou,
Michael B. Gotway, Jianming Liang
- Abstract summary: "transferable visual words" (TransVW) aims to achieve annotation efficiency for deep learning in medical image analysis.
We show that these visual words can be automatically harvested according to anatomical consistency via self-discovery.
Our experiments demonstrate the annotation efficiency of TransVW by offering higher performance and faster convergence with reduced annotation cost.
- Score: 6.569456721086925
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces a new concept called "transferable visual words"
(TransVW), aiming to achieve annotation efficiency for deep learning in medical
image analysis. Medical imaging--focusing on particular parts of the body for
defined clinical purposes--generates images of great similarity in anatomy
across patients and yields sophisticated anatomical patterns across images,
which are associated with rich semantics about human anatomy and which are
natural visual words. We show that these visual words can be automatically
harvested according to anatomical consistency via self-discovery, and that the
self-discovered visual words can serve as strong yet free supervision signals
for deep models to learn semantics-enriched generic image representation via
self-supervision (self-classification and self-restoration). Our extensive
experiments demonstrate the annotation efficiency of TransVW by offering higher
performance and faster convergence with reduced annotation cost in several
applications. Our TransVW has several important advantages, including (1)
TransVW is a fully autodidactic scheme, which exploits the semantics of visual
words for self-supervised learning, requiring no expert annotation; (2) visual
word learning is an add-on strategy, which complements existing self-supervised
methods, boosting their performance; and (3) the learned image representation
is semantics-enriched models, which have proven to be more robust and
generalizable, saving annotation efforts for a variety of applications through
transfer learning. Our code, pre-trained models, and curated visual words are
available at https://github.com/JLiangLab/TransVW.
Related papers
- Decoding Visual Experience and Mapping Semantics through Whole-Brain Analysis Using fMRI Foundation Models [10.615012396285337]
We develop algorithms to enhance our understanding of visual processes by incorporating whole-brain activation maps.
We first compare our method with state-of-the-art approaches to decoding visual processing and show improved predictive semantic accuracy by 43%.
arXiv Detail & Related papers (2024-11-11T16:51:17Z) - Using Multimodal Deep Neural Networks to Disentangle Language from Visual Aesthetics [8.749640179057469]
We use linear decoding over the learned representations of unimodal vision, unimodal language, and multimodal deep neural network (DNN) models to predict human beauty ratings of naturalistic images.
We show that unimodal vision models (e.g. SimCLR) account for the vast majority of explainable variance in these ratings. Language-aligned vision models (e.g. SLIP) yield small gains relative to unimodal vision.
Taken together, these results suggest that whatever words we may eventually find to describe our experience of beauty, the ineffable computations of feedforward perception may provide sufficient foundation for that experience.
arXiv Detail & Related papers (2024-10-31T03:37:21Z) - Autoregressive Sequence Modeling for 3D Medical Image Representation [48.706230961589924]
We introduce a pioneering method for learning 3D medical image representations through an autoregressive sequence pre-training framework.
Our approach various 3D medical images based on spatial, contrast, and semantic correlations, treating them as interconnected visual tokens within a token sequence.
arXiv Detail & Related papers (2024-09-13T10:19:10Z) - Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning [64.1316997189396]
We present a novel language-tied self-supervised learning framework, Hierarchical Language-tied Self-Supervision (HLSS) for histopathology images.
Our resulting model achieves state-of-the-art performance on two medical imaging benchmarks, OpenSRH and TCGA datasets.
arXiv Detail & Related papers (2024-03-21T17:58:56Z) - MindGPT: Interpreting What You See with Non-invasive Brain Recordings [24.63828455553959]
We introduce a non-invasive neural decoder, termed as MindGPT, which interprets perceived visual stimuli into natural languages from fMRI signals.
Our experiments show that the generated word sequences truthfully represented the visual information conveyed in the seen stimuli.
arXiv Detail & Related papers (2023-09-27T15:35:20Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - Multi-Modal Masked Autoencoders for Medical Vision-and-Language
Pre-Training [62.215025958347105]
We propose a self-supervised learning paradigm with multi-modal masked autoencoders.
We learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts.
arXiv Detail & Related papers (2022-09-15T07:26:43Z) - K-LITE: Learning Transferable Visual Models with External Knowledge [242.3887854728843]
K-LITE (Knowledge-augmented Language-Image Training and Evaluation) is a strategy to leverage external knowledge to build transferable visual systems.
In training, it enriches entities in natural language with WordNet and Wiktionary knowledge.
In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts.
arXiv Detail & Related papers (2022-04-20T04:47:01Z) - Learning Semantics-enriched Representation via Self-discovery,
Self-classification, and Self-restoration [12.609383051645887]
We train deep models to learn semantically enriched visual representation by self-discovery, self-classification, and self-restoration of the anatomy underneath medical images.
We examine our Semantic Genesis with all the publicly-available pre-trained models, by either self-supervision or fully supervision, on the six distinct target tasks.
Our experiments demonstrate that Semantic Genesis significantly exceeds all of its 3D counterparts as well as the de facto ImageNet-based transfer learning in 2D.
arXiv Detail & Related papers (2020-07-14T10:36:10Z) - Learning Representations by Predicting Bags of Visual Words [55.332200948110895]
Self-supervised representation learning targets to learn convnet-based image representations from unlabeled data.
Inspired by the success of NLP methods in this area, in this work we propose a self-supervised approach based on spatially dense image descriptions.
arXiv Detail & Related papers (2020-02-27T16:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.