Evaluating Picture Description Speech for Dementia Detection using
Image-text Alignment
- URL: http://arxiv.org/abs/2308.07933v1
- Date: Fri, 11 Aug 2023 08:42:37 GMT
- Title: Evaluating Picture Description Speech for Dementia Detection using
Image-text Alignment
- Authors: Youxiang Zhu, Nana Lin, Xiaohui Liang, John A. Batsis, Robert M. Roth,
Brian MacWhinney
- Abstract summary: We propose the first dementia detection models that take both the picture and the description texts as inputs.
We observe the difference between dementia and healthy samples in terms of the text's relevance to the picture and the focused area of the picture.
We propose three advanced models that pre-processed the samples based on their relevance to the picture, sub-image, and focused areas.
- Score: 10.008388878255538
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Using picture description speech for dementia detection has been studied for
30 years. Despite the long history, previous models focus on identifying the
differences in speech patterns between healthy subjects and patients with
dementia but do not utilize the picture information directly. In this paper, we
propose the first dementia detection models that take both the picture and the
description texts as inputs and incorporate knowledge from large pre-trained
image-text alignment models. We observe the difference between dementia and
healthy samples in terms of the text's relevance to the picture and the focused
area of the picture. We thus consider such a difference could be used to
enhance dementia detection accuracy. Specifically, we use the text's relevance
to the picture to rank and filter the sentences of the samples. We also
identified focused areas of the picture as topics and categorized the sentences
according to the focused areas. We propose three advanced models that
pre-processed the samples based on their relevance to the picture, sub-image,
and focused areas. The evaluation results show that our advanced models, with
knowledge of the picture and large image-text alignment models, achieve
state-of-the-art performance with the best detection accuracy at 83.44%, which
is higher than the text-only baseline model at 79.91%. Lastly, we visualize the
sample and picture results to explain the advantages of our models.
Related papers
- Words Worth a Thousand Pictures: Measuring and Understanding Perceptual Variability in Text-to-Image Generation [58.77994391566484]
Diffusion models are the state of the art in text-to-image generation, but their perceptual variability remains understudied.
We propose W1KP, a human-calibrated measure of variability in a set of images, bootstrapped from existing image-pair perceptual distances.
Our best perceptual distance outperforms nine baselines by up to 18 points in accuracy, and our calibration matches human judgements 78% of the time.
arXiv Detail & Related papers (2024-06-12T17:59:27Z) - Information Theoretic Text-to-Image Alignment [49.396917351264655]
We present a novel method that relies on an information-theoretic alignment measure to steer image generation.
Our method is on-par or superior to the state-of-the-art, yet requires nothing but a pre-trained denoising network to estimate MI.
arXiv Detail & Related papers (2024-05-31T12:20:02Z) - Learning from Models and Data for Visual Grounding [55.21937116752679]
We introduce SynGround, a framework that combines data-driven learning and knowledge transfer from various large-scale pretrained models.
We finetune a pretrained vision-and-language model on this dataset by optimizing a mask-attention objective.
The resulting model improves the grounding capabilities of an off-the-shelf vision-and-language model.
arXiv Detail & Related papers (2024-03-20T17:59:43Z) - A Picture is Worth a Thousand Words: Principled Recaptioning Improves
Image Generation [9.552642210681489]
We show that by relabeling the corpus with a specialized automatic captioning model and training a text-to-image model on the recaptioned dataset, the model benefits substantially across the board.
We analyze various ways to relabel the corpus and provide evidence that this technique, which we call RECAP, both reduces the train-inference discrepancy and provides the model with more information per example.
arXiv Detail & Related papers (2023-10-25T14:10:08Z) - Text-guided Foundation Model Adaptation for Pathological Image
Classification [40.45252665455015]
We propose to connect image and text Embeddings (CITE) to enhance pathological image classification.
CITE injects text insights gained from language models pre-trained with a broad range of biomedical texts, leading to adapt foundation models towards pathological image understanding.
arXiv Detail & Related papers (2023-07-27T14:44:56Z) - Unified Medical Image-Text-Label Contrastive Learning With Continuous
Prompt [3.218449686637963]
We propose a unified Image-Text-Label contrastive learning framework based on continuous prompts.
We demonstrate through sufficient experiments that the Unified Medical Contrastive Learning framework exhibits excellent performance on several downstream tasks.
arXiv Detail & Related papers (2023-07-12T05:19:10Z) - Composition and Deformance: Measuring Imageability with a Text-to-Image
Model [8.008504325316327]
We propose methods that use generated images to measure the imageability of single English words and connected text.
We find high correlation between the proposed computational measures of imageability and human judgments of individual words.
We discuss possible effects of model training and implications for the study of compositionality in text-to-image models.
arXiv Detail & Related papers (2023-06-05T18:22:23Z) - Simple Token-Level Confidence Improves Caption Correctness [117.33497608933169]
Token-Level Confidence, or TLC, is a simple yet surprisingly effective method to assess caption correctness.
We fine-tune a vision-language model on image captioning, input an image and proposed caption to the model, and aggregate token confidences over words or sequences to estimate image-caption consistency.
arXiv Detail & Related papers (2023-05-11T17:58:17Z) - Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z) - Reading and Writing: Discriminative and Generative Modeling for
Self-Supervised Text Recognition [101.60244147302197]
We introduce contrastive learning and masked image modeling to learn discrimination and generation of text images.
Our method outperforms previous self-supervised text recognition methods by 10.2%-20.2% on irregular scene text recognition datasets.
Our proposed text recognizer exceeds previous state-of-the-art text recognition methods by averagely 5.3% on 11 benchmarks, with similar model size.
arXiv Detail & Related papers (2022-07-01T03:50:26Z) - Using Human Psychophysics to Evaluate Generalization in Scene Text
Recognition Models [7.294729862905325]
We characterize two important scene text recognition models by measuring their domains.
The domains specifies the ability of readers to generalize to different word lengths, fonts, and amounts of occlusion.
arXiv Detail & Related papers (2020-06-30T19:51:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.