Show from Tell: Audio-Visual Modelling in Clinical Settings
- URL: http://arxiv.org/abs/2310.16477v1
- Date: Wed, 25 Oct 2023 08:55:48 GMT
- Title: Show from Tell: Audio-Visual Modelling in Clinical Settings
- Authors: Jianbo Jiao, Mohammad Alsharid, Lior Drukker, Aris T. Papageorghiou,
Andrew Zisserman, J. Alison Noble
- Abstract summary: We consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations without human expert annotation.
A simple yet effective multi-modal self-supervised learning framework is proposed for this purpose.
The proposed approach is able to localise anatomical regions of interest during ultrasound imaging, with only speech audio as a reference.
- Score: 58.88175583465277
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Auditory and visual signals usually present together and correlate with each
other, not only in natural environments but also in clinical settings. However,
the audio-visual modelling in the latter case can be more challenging, due to
the different sources of audio/video signals and the noise (both signal-level
and semantic-level) in auditory signals -- usually speech. In this paper, we
consider audio-visual modelling in a clinical setting, providing a solution to
learn medical representations that benefit various clinical tasks, without
human expert annotation. A simple yet effective multi-modal self-supervised
learning framework is proposed for this purpose. The proposed approach is able
to localise anatomical regions of interest during ultrasound imaging, with only
speech audio as a reference. Experimental evaluations on a large-scale clinical
multi-modal ultrasound video dataset show that the proposed self-supervised
method learns good transferable anatomical representations that boost the
performance of automated downstream clinical tasks, even outperforming
fully-supervised solutions.
Related papers
- EchoApex: A General-Purpose Vision Foundation Model for Echocardiography [9.202542805578432]
We introduce EchoApex, the first general-purpose vision foundation model echocardiography with applications on a variety of clinical practice.
Leveraging self-supervised learning, EchoApex is pretrained on over 20 million echo images from 11 clinical centres.
Compared to state-of-the-art task-specific models, EchoApex attains improved performance with a unified image encoding architecture.
arXiv Detail & Related papers (2024-10-14T21:10:56Z) - Unveiling and Mitigating Bias in Audio Visual Segmentation [9.427676046134374]
Community researchers have developed a range of advanced audio-visual segmentation models to improve the quality of sounding objects' masks.
While masks created by these models may initially appear plausible, they occasionally exhibit anomalies with incorrect grounding logic.
We attribute this to real-world inherent preferences and distributions as a simpler signal for learning than the complex audio-visual grounding.
arXiv Detail & Related papers (2024-07-23T16:55:04Z) - Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding [53.629132242389716]
Vision-Language Models (VLM) can support clinicians by analyzing medical images and engaging in natural language interactions.
VLMs often exhibit "hallucinogenic" behavior, generating textual outputs not grounded in contextual multimodal information.
We propose a new alignment algorithm that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge.
arXiv Detail & Related papers (2024-05-29T23:19:28Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - Robust Medical Image Classification from Noisy Labeled Data with Global
and Local Representation Guided Co-training [73.60883490436956]
We propose a novel collaborative training paradigm with global and local representation learning for robust medical image classification.
We employ the self-ensemble model with a noisy label filter to efficiently select the clean and noisy samples.
We also design a novel global and local representation learning scheme to implicitly regularize the networks to utilize noisy samples.
arXiv Detail & Related papers (2022-05-10T07:50:08Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - Comparison of Speaker Role Recognition and Speaker Enrollment Protocol
for conversational Clinical Interviews [9.728371067160941]
We train end-to-end neural network architectures to adapt to each task and evaluate each approach under the same metric.
Results do not depend on the demographics of the Interviewee, highlighting the clinical relevance of our methods.
arXiv Detail & Related papers (2020-10-30T09:07:37Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Self-supervised Contrastive Video-Speech Representation Learning for
Ultrasound [15.517484333872277]
In medical imaging, manual annotations can be expensive to acquire and sometimes infeasible to access.
We propose to address the problem of self-supervised representation learning with multi-modal ultrasound video-speech raw data.
arXiv Detail & Related papers (2020-08-14T23:58:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.