Self-supervised Contrastive Video-Speech Representation Learning for
Ultrasound
- URL: http://arxiv.org/abs/2008.06607v1
- Date: Fri, 14 Aug 2020 23:58:23 GMT
- Title: Self-supervised Contrastive Video-Speech Representation Learning for
Ultrasound
- Authors: Jianbo Jiao, Yifan Cai, Mohammad Alsharid, Lior Drukker, Aris
T.Papageorghiou, and J. Alison Noble
- Abstract summary: In medical imaging, manual annotations can be expensive to acquire and sometimes infeasible to access.
We propose to address the problem of self-supervised representation learning with multi-modal ultrasound video-speech raw data.
- Score: 15.517484333872277
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In medical imaging, manual annotations can be expensive to acquire and
sometimes infeasible to access, making conventional deep learning-based models
difficult to scale. As a result, it would be beneficial if useful
representations could be derived from raw data without the need for manual
annotations. In this paper, we propose to address the problem of
self-supervised representation learning with multi-modal ultrasound
video-speech raw data. For this case, we assume that there is a high
correlation between the ultrasound video and the corresponding narrative speech
audio of the sonographer. In order to learn meaningful representations, the
model needs to identify such correlation and at the same time understand the
underlying anatomical features. We designed a framework to model the
correspondence between video and audio without any kind of human annotations.
Within this framework, we introduce cross-modal contrastive learning and an
affinity-aware self-paced learning scheme to enhance correlation modelling.
Experimental evaluations on multi-modal fetal ultrasound video and audio show
that the proposed approach is able to learn strong representations and
transfers well to downstream tasks of standard plane detection and eye-gaze
prediction.
Related papers
- Sequential Contrastive Audio-Visual Learning [12.848371604063168]
We propose sequential contrastive audio-visual learning (SCAV), which contrasts examples based on their non-aggregated representation space using sequential distances.
Retrieval experiments with the VGGSound and Music datasets demonstrate the effectiveness of SCAV.
We also show that models trained with SCAV exhibit a high degree of flexibility regarding the metric employed for retrieval, allowing them to operate on a spectrum of efficiency-accuracy trade-offs.
arXiv Detail & Related papers (2024-07-08T09:45:20Z) - Show from Tell: Audio-Visual Modelling in Clinical Settings [58.88175583465277]
We consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations without human expert annotation.
A simple yet effective multi-modal self-supervised learning framework is proposed for this purpose.
The proposed approach is able to localise anatomical regions of interest during ultrasound imaging, with only speech audio as a reference.
arXiv Detail & Related papers (2023-10-25T08:55:48Z) - Mapping EEG Signals to Visual Stimuli: A Deep Learning Approach to Match
vs. Mismatch Classification [28.186129896907694]
We propose a "match-vs-mismatch" deep learning model to classify whether a video clip induces excitatory responses in recorded EEG signals.
We demonstrate that the proposed model is able to achieve the highest accuracy on unseen subjects.
These results have the potential to facilitate the development of neural recording-based video reconstruction.
arXiv Detail & Related papers (2023-09-08T06:37:25Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z) - Improving Ultrasound Tongue Image Reconstruction from Lip Images Using
Self-supervised Learning and Attention Mechanism [1.52292571922932]
Given an observable image sequences of lips, can we picture the corresponding tongue motion?
We formulated this problem as the self-supervised learning problem, and employ the two-stream convolutional network and long-short memory network for the learning task, with the attention mechanism.
The results show that our model is able to generate images that close to the real ultrasound tongue images, and results in the matching between two imaging modalities.
arXiv Detail & Related papers (2021-06-20T10:51:23Z) - Deep Co-Attention Network for Multi-View Subspace Learning [73.3450258002607]
We propose a deep co-attention network for multi-view subspace learning.
It aims to extract both the common information and the complementary information in an adversarial setting.
In particular, it uses a novel cross reconstruction loss and leverages the label information to guide the construction of the latent representation.
arXiv Detail & Related papers (2021-02-15T18:46:44Z) - Automatic Curation of Large-Scale Datasets for Audio-Visual
Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation.
We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z) - Speech Prediction in Silent Videos using Variational Autoencoders [29.423462898526605]
We present a model for generating speech in a silent video.
The proposed model combines recurrent neural networks and variational deep generative models to learn the auditory's conditional distribution.
We demonstrate the performance of our model on the GRID dataset based on standard benchmarks.
arXiv Detail & Related papers (2020-11-14T17:09:03Z) - Towards Unsupervised Learning for Instrument Segmentation in Robotic
Surgery with Cycle-Consistent Adversarial Networks [54.00217496410142]
We propose an unpaired image-to-image translation where the goal is to learn the mapping between an input endoscopic image and a corresponding annotation.
Our approach allows to train image segmentation models without the need to acquire expensive annotations.
We test our proposed method on Endovis 2017 challenge dataset and show that it is competitive with supervised segmentation methods.
arXiv Detail & Related papers (2020-07-09T01:39:39Z) - Self-supervised Representation Learning for Ultrasound Video [18.515314344284445]
We propose a self-supervised learning approach to learn meaningful and transferable representations from medical imaging video.
We force the model to address anatomy-aware tasks with free supervision from the data itself.
Experiments on fetal ultrasound video show that the proposed approach can effectively learn meaningful and strong representations.
arXiv Detail & Related papers (2020-02-28T23:00:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.