Evaluating Picture Description Speech for Dementia Detection using
Image-text Alignment
- URL: http://arxiv.org/abs/2308.07933v1
- Date: Fri, 11 Aug 2023 08:42:37 GMT
- Title: Evaluating Picture Description Speech for Dementia Detection using
Image-text Alignment
- Authors: Youxiang Zhu, Nana Lin, Xiaohui Liang, John A. Batsis, Robert M. Roth,
Brian MacWhinney
- Abstract summary: We propose the first dementia detection models that take both the picture and the description texts as inputs.
We observe the difference between dementia and healthy samples in terms of the text's relevance to the picture and the focused area of the picture.
We propose three advanced models that pre-processed the samples based on their relevance to the picture, sub-image, and focused areas.
- Score: 10.008388878255538
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Using picture description speech for dementia detection has been studied for
30 years. Despite the long history, previous models focus on identifying the
differences in speech patterns between healthy subjects and patients with
dementia but do not utilize the picture information directly. In this paper, we
propose the first dementia detection models that take both the picture and the
description texts as inputs and incorporate knowledge from large pre-trained
image-text alignment models. We observe the difference between dementia and
healthy samples in terms of the text's relevance to the picture and the focused
area of the picture. We thus consider such a difference could be used to
enhance dementia detection accuracy. Specifically, we use the text's relevance
to the picture to rank and filter the sentences of the samples. We also
identified focused areas of the picture as topics and categorized the sentences
according to the focused areas. We propose three advanced models that
pre-processed the samples based on their relevance to the picture, sub-image,
and focused areas. The evaluation results show that our advanced models, with
knowledge of the picture and large image-text alignment models, achieve
state-of-the-art performance with the best detection accuracy at 83.44%, which
is higher than the text-only baseline model at 79.91%. Lastly, we visualize the
sample and picture results to explain the advantages of our models.
Related papers
- Bridging the Gap: Aligning Text-to-Image Diffusion Models with Specific Feedback [5.415802995586328]
Learning from feedback has been shown to enhance the alignment between text prompts and images in text-to-image diffusion models.
We propose an efficient fine-turning method with specific reward objectives, including three stages.
Experimental results on this benchmark show that our model outperforms other SOTA methods in both alignment and fidelity.
arXiv Detail & Related papers (2024-11-28T09:56:28Z) - Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models [16.00576040281808]
We propose a novel framework called Image2Text2Image to evaluate image captioning models.
A high similarity score suggests that the model has produced a faithful textual description, while a low score highlights discrepancies.
Our framework does not rely on human-annotated captions reference, making it a valuable tool for assessing image captioning models.
arXiv Detail & Related papers (2024-11-08T17:07:01Z) - Words Worth a Thousand Pictures: Measuring and Understanding Perceptual Variability in Text-to-Image Generation [58.77994391566484]
We propose W1KP, a human-calibrated measure of variability in a set of images.
Our best perceptual distance outperforms nine baselines by up to 18 points in accuracy.
We analyze 56 linguistic features of real prompts, finding that the prompt's length, CLIP embedding norm, concreteness, and word senses influence variability most.
arXiv Detail & Related papers (2024-06-12T17:59:27Z) - Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers [120.49126407479717]
This paper explores text-to-image diffusion models for Zero-Shot Sketch-based Image Retrieval (ZS-SBIR)
We highlight a pivotal discovery: the capacity of text-to-image diffusion models to seamlessly bridge the gap between sketches and photos.
arXiv Detail & Related papers (2024-03-12T00:02:03Z) - A Picture is Worth a Thousand Words: Principled Recaptioning Improves
Image Generation [9.552642210681489]
We show that by relabeling the corpus with a specialized automatic captioning model and training a text-to-image model on the recaptioned dataset, the model benefits substantially across the board.
We analyze various ways to relabel the corpus and provide evidence that this technique, which we call RECAP, both reduces the train-inference discrepancy and provides the model with more information per example.
arXiv Detail & Related papers (2023-10-25T14:10:08Z) - Unified Medical Image-Text-Label Contrastive Learning With Continuous
Prompt [3.218449686637963]
We propose a unified Image-Text-Label contrastive learning framework based on continuous prompts.
We demonstrate through sufficient experiments that the Unified Medical Contrastive Learning framework exhibits excellent performance on several downstream tasks.
arXiv Detail & Related papers (2023-07-12T05:19:10Z) - Simple Token-Level Confidence Improves Caption Correctness [117.33497608933169]
Token-Level Confidence, or TLC, is a simple yet surprisingly effective method to assess caption correctness.
We fine-tune a vision-language model on image captioning, input an image and proposed caption to the model, and aggregate token confidences over words or sequences to estimate image-caption consistency.
arXiv Detail & Related papers (2023-05-11T17:58:17Z) - Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z) - Reading and Writing: Discriminative and Generative Modeling for
Self-Supervised Text Recognition [101.60244147302197]
We introduce contrastive learning and masked image modeling to learn discrimination and generation of text images.
Our method outperforms previous self-supervised text recognition methods by 10.2%-20.2% on irregular scene text recognition datasets.
Our proposed text recognizer exceeds previous state-of-the-art text recognition methods by averagely 5.3% on 11 benchmarks, with similar model size.
arXiv Detail & Related papers (2022-07-01T03:50:26Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z) - Using Human Psychophysics to Evaluate Generalization in Scene Text
Recognition Models [7.294729862905325]
We characterize two important scene text recognition models by measuring their domains.
The domains specifies the ability of readers to generalize to different word lengths, fonts, and amounts of occlusion.
arXiv Detail & Related papers (2020-06-30T19:51:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.