Related papers: Do LLMs Provide Consistent Answers to Health-Related Questions across Languages?

Do LLMs Provide Consistent Answers to Health-Related Questions across Languages?

URL: http://arxiv.org/abs/2501.14719v1
Date: Fri, 24 Jan 2025 18:51:26 GMT
Title: Do LLMs Provide Consistent Answers to Health-Related Questions across Languages?
Authors: Ipek Baris Schlicht, Zhixue Zhao, Burcu Sayin, Lucie Flek, Paolo Rosso,
Abstract summary: We examine the consistency of responses provided by Large Language Models (LLMs) to health-related questions across English, German, Turkish, and Chinese.<n>We reveal significant inconsistencies in responses that could spread healthcare misinformation.<n>Our findings emphasize the need for improved cross-lingual alignment to ensure accurate and equitable healthcare information.
Score: 14.87110905165928
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Equitable access to reliable health information is vital for public health, but the quality of online health resources varies by language, raising concerns about inconsistencies in Large Language Models (LLMs) for healthcare. In this study, we examine the consistency of responses provided by LLMs to health-related questions across English, German, Turkish, and Chinese. We largely expand the HealthFC dataset by categorizing health-related questions by disease type and broadening its multilingual scope with Turkish and Chinese translations. We reveal significant inconsistencies in responses that could spread healthcare misinformation. Our main contributions are 1) a multilingual health-related inquiry dataset with meta-information on disease categories, and 2) a novel prompt-based evaluation workflow that enables sub-dimensional comparisons between two languages through parsing. Our findings highlight key challenges in deploying LLM-based tools in multilingual contexts and emphasize the need for improved cross-lingual alignment to ensure accurate and equitable healthcare information.

Related papers

CLINIC: Evaluating Multilingual Trustworthiness in Language Models for Healthcare [25.074475493111162]
We present CLINIC, a Comprehensive Benchmark to evaluate the trustworthiness of language models in healthcare.<n>Our evaluation reveals that LMs struggle with factual correctness, demonstrate bias across demographic and linguistic groups, and are susceptible to privacy breaches and adversarial attacks.
arXiv Detail & Related papers (2025-12-12T10:19:27Z)
Faithful Summarization of Consumer Health Queries: A Cross-Lingual Framework with LLMs [0.0]
We propose a framework that combines TextRank-based sentence extraction and medical named entity recognition.<n>We fine-tuned the LLaMA-2-7B model on the MeQSum (English) and BanglaCHQ-Summ (Bangla) datasets.<n>Human evaluation shows that over 80% of generated summaries preserve critical medical information.
arXiv Detail & Related papers (2025-11-13T19:42:11Z)
Disparities in Multilingual LLM-Based Healthcare Q&A [15.114074152947971]
We examine cross-lingual disparities in pre-training source and factuality alignment in multilingual healthcare Q&A answers.<n>Our findings reveal substantial cross-lingual disparities in both Wikipedia coverage and LLM factual alignment.<n>Providing contextual excerpts from non-English Wikipedia at inference time effectively shifts factual alignment toward culturally relevant knowledge.
arXiv Detail & Related papers (2025-10-20T12:19:08Z)
Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their ability to directly recall and apply factual medical knowledge remains under-explored. Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities. We introduce the Medical Knowledge Judgment, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge.
arXiv Detail & Related papers (2025-02-20T05:27:51Z)
Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing LLM Ophthalmological QA in LMICs [3.1894617416005855]
Large language models (LLMs) present a promising solution to automate various ophthalmology procedures. LLMs have demonstrated significantly varied performance across different languages in natural language question-answering tasks. This study introduces the first multilingual ophthalmological question-answering benchmark with manually curated questions parallel across languages.
arXiv Detail & Related papers (2024-12-18T20:18:03Z)
A Survey of Medical Vision-and-Language Applications and Their Techniques [48.268198631277315]
Medical vision-and-language models (MVLMs) have attracted substantial interest due to their capability to offer a natural language interface for interpreting complex medical data. Here, we provide a comprehensive overview of MVLMs and the various medical tasks to which they have been applied. We also examine the datasets used for these tasks and compare the performance of different models based on standardized evaluation metrics.
arXiv Detail & Related papers (2024-11-19T03:27:05Z)
HealthQ: Unveiling Questioning Capabilities of LLM Chains in Healthcare Conversations [23.09755446991835]
In digital healthcare, large language models (LLMs) have primarily been utilized to enhance question-answering capabilities. This paper presents HealthQ, a novel framework designed to evaluate the questioning capabilities of LLM healthcare chains.
arXiv Detail & Related papers (2024-09-28T23:59:46Z)
Severity Prediction in Mental Health: LLM-based Creation, Analysis, Evaluation of a Novel Multilingual Dataset [3.4146360486107987]
Large Language Models (LLMs) are increasingly integrated into various medical fields, including mental health support systems. We present a novel multilingual adaptation of widely-used mental health datasets, translated from English into six languages. This dataset enables a comprehensive evaluation of LLM performance in detecting mental health conditions and assessing their severity across multiple languages.
arXiv Detail & Related papers (2024-09-25T22:14:34Z)
CHBench: A Chinese Dataset for Evaluating Health in Large Language Models [19.209493319541693]
We present CHBench, the first comprehensive Chinese Health-related Benchmark. CHBench includes 6,493 entries related to mental health and 2,999 entries focused on physical health. This dataset serves as a foundation for evaluating Chinese LLMs' capacity to comprehend and generate accurate health-related information.
arXiv Detail & Related papers (2024-09-24T05:44:46Z)
A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers [51.8203871494146]
The rapid development of Large Language Models (LLMs) demonstrates remarkable multilingual capabilities in natural language processing.<n>Despite the breakthroughs of LLMs, the investigation into the multilingual scenario remains insufficient.<n>This survey aims to help the research community address multilingual problems and provide a comprehensive understanding of the core concepts, key techniques, and latest developments in multilingual natural language processing based on LLMs.
arXiv Detail & Related papers (2024-05-17T17:47:39Z)
Google Translate Error Analysis for Mental Healthcare Information: Evaluating Accuracy, Comprehensibility, and Implications for Multilingual Healthcare Communication [8.178490288773013]
This study explores the use of Google Translate for translating mental healthcare (MHealth) information from English to Persian, Arabic, Turkish, Romanian, and Spanish. Native speakers of the target languages manually assessed the GT translations, focusing on medical terminology accuracy, comprehensibility, and critical syntactic/semantic errors. GT output analysis revealed challenges in accurately translating medical terminology, particularly in Arabic, Romanian, and Persian.
arXiv Detail & Related papers (2024-02-06T14:16:32Z)
Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries [31.82249599013959]
Large language models (LLMs) are transforming the ways the general public accesses and consumes information. LLMs demonstrate impressive language understanding and generation proficiencies, but concerns regarding their safety remain paramount. It remains unclear how these LLMs perform in the context of non-English languages.
arXiv Detail & Related papers (2023-10-19T20:02:40Z)
Are Large Language Models Ready for Healthcare? A Comparative Study on Clinical Language Understanding [12.128991867050487]
Large language models (LLMs) have made significant progress in various domains, including healthcare. In this study, we evaluate state-of-the-art LLMs within the realm of clinical language understanding tasks.
arXiv Detail & Related papers (2023-04-09T16:31:47Z)
Delving Deeper into Cross-lingual Visual Question Answering [115.16614806717341]
We show that simple modifications to the standard training setup can substantially reduce the transfer gap to monolingual English performance. We analyze cross-lingual VQA across different question types of varying complexity for different multilingual multimodal Transformers.
arXiv Detail & Related papers (2022-02-15T18:22:18Z)
AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context. It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts. Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)
Enhancing Answer Boundary Detection for Multilingual Machine Reading Comprehension [86.1617182312817]
We propose two auxiliary tasks in the fine-tuning stage to create additional phrase boundary supervision. A mixed Machine Reading task, which translates the question or passage to other languages and builds cross-lingual question-passage pairs. A language-agnostic knowledge masking task by leveraging knowledge phrases mined from web.
arXiv Detail & Related papers (2020-04-29T10:44:00Z)
Self-Attention with Cross-Lingual Position Representation [112.05807284056337]
Position encoding (PE) is used to preserve the word order information for natural language processing tasks, generating fixed position indices for input sequences. Due to word order divergences in different languages, modeling the cross-lingual positional relationships might help SANs tackle this problem. We augment SANs with emphcross-lingual position representations to model the bilingually aware latent structure for the input sentence.
arXiv Detail & Related papers (2020-04-28T05:23:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.