Related papers: Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing LLM Ophthalmological QA in LMICs

Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing LLM Ophthalmological QA in LMICs

URL: http://arxiv.org/abs/2412.14304v1
Date: Wed, 18 Dec 2024 20:18:03 GMT
Title: Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing LLM Ophthalmological QA in LMICs
Authors: David Restrepo, Chenwei Wu, Zhengxu Tang, Zitao Shuai, Thao Nguyen Minh Phan, Jun-En Ding, Cong-Tinh Dao, Jack Gallifant, Robyn Gayle Dychiao, Jose Carlo Artiaga, André Hiroshi Bando, Carolina Pelegrini Barbosa Gracitelli, Vincenz Ferrer, Leo Anthony Celi, Danielle Bitterman, Michael G Morley, Luis Filipe Nakayama,
Abstract summary: Large language models (LLMs) present a promising solution to automate various ophthalmology procedures.<n>LLMs have demonstrated significantly varied performance across different languages in natural language question-answering tasks.<n>This study introduces the first multilingual ophthalmological question-answering benchmark with manually curated questions parallel across languages.
Score: 3.1894617416005855
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current ophthalmology clinical workflows are plagued by over-referrals, long waits, and complex and heterogeneous medical records. Large language models (LLMs) present a promising solution to automate various procedures such as triaging, preliminary tests like visual acuity assessment, and report summaries. However, LLMs have demonstrated significantly varied performance across different languages in natural language question-answering tasks, potentially exacerbating healthcare disparities in Low and Middle-Income Countries (LMICs). This study introduces the first multilingual ophthalmological question-answering benchmark with manually curated questions parallel across languages, allowing for direct cross-lingual comparisons. Our evaluation of 6 popular LLMs across 7 different languages reveals substantial bias across different languages, highlighting risks for clinical deployment of LLMs in LMICs. Existing debiasing methods such as Translation Chain-of-Thought or Retrieval-augmented generation (RAG) by themselves fall short of closing this performance gap, often failing to improve performance across all languages and lacking specificity for the medical domain. To address this issue, We propose CLARA (Cross-Lingual Reflective Agentic system), a novel inference time de-biasing method leveraging retrieval augmented generation and self-verification. Our approach not only improves performance across all languages but also significantly reduces the multilingual bias gap, facilitating equitable LLM application across the globe.

Related papers

Cross-Lingual Empirical Evaluation of Large Language Models for Arabic Medical Tasks [12.886024273517556]
Large Language Models (LLMs) have become widely used in medical applications, such as clinical decision support, medical education, and medical question answering.<n>Yet, these models are often English-centric, limiting their robustness and reliability for linguistically diverse communities.<n>Recent work has highlighted discrepancies in performance in low-resource languages for various medical tasks, but the underlying causes remain poorly understood.
arXiv Detail & Related papers (2026-02-05T06:52:46Z)
The Emergence of Abstract Thought in Large Language Models Beyond Any Language [95.50197866832772]
Large language models (LLMs) function effectively across a diverse range of languages.<n>Preliminary studies observe that the hidden activations of LLMs often resemble English, even when responding to non-English prompts.<n>Recent results show strong multilingual performance, even surpassing English performance on specific tasks in other languages.
arXiv Detail & Related papers (2025-06-11T16:00:54Z)
Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models [55.14276067678253]
This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in Large Language Models (LLMs)<n>We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models.<n>Further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns.
arXiv Detail & Related papers (2025-05-24T12:31:27Z)
Exploring the Multilingual NLG Evaluation Abilities of LLM-Based Evaluators [38.681443695708786]
This study provides a comprehensive analysis of the multilingual evaluation performance of 10 recent LLMs. We found that excluding the reference answer from the prompt leads to better performance across various languages. Most LLM-based evaluators show a higher correlation with human judgments in high-resource languages than in low-resource languages.
arXiv Detail & Related papers (2025-03-06T12:04:29Z)
Bridging Language Barriers in Healthcare: A Study on Arabic LLMs [1.2006896500048552]
This paper investigates the challenges of developing large language models proficient in both multilingual understanding and medical knowledge. We find that larger models with carefully calibrated language ratios achieve superior performance on native-language clinical tasks.
arXiv Detail & Related papers (2025-01-16T20:24:56Z)
Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization [108.6908427615402]
Cross-lingual summarization ( CLS) aims to generate a summary for the source text in a different target language. Currently, instruction-tuned large language models (LLMs) excel at various English tasks. Recent studies have shown that LLMs' performance on CLS tasks remains unsatisfactory even with few-shot settings.
arXiv Detail & Related papers (2024-10-26T00:39:44Z)
Severity Prediction in Mental Health: LLM-based Creation, Analysis, Evaluation of a Novel Multilingual Dataset [3.4146360486107987]
Large Language Models (LLMs) are increasingly integrated into various medical fields, including mental health support systems. We present a novel multilingual adaptation of widely-used mental health datasets, translated from English into six languages. This dataset enables a comprehensive evaluation of LLM performance in detecting mental health conditions and assessing their severity across multiple languages.
arXiv Detail & Related papers (2024-09-25T22:14:34Z)
Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora. But can these models relate corresponding concepts across languages, effectively being crosslingual? This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z)
Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners [67.85635044939836]
Large Language Models (LLMs) have shown impressive language capabilities. In this work, we investigate the spontaneous multilingual alignment improvement of LLMs. We find that LLMs instruction-tuned on the question translation data (i.e. without annotated answers) are able to encourage the alignment between English and a wide range of languages.
arXiv Detail & Related papers (2024-05-22T16:46:19Z)
How do Large Language Models Handle Multilingualism? [81.15060972112563]
This study explores how large language models (LLMs) handle multilingualism. LLMs initially understand the query, converting multilingual inputs into English for task-solving. In the intermediate layers, they employ English for thinking and incorporate multilingual knowledge with self-attention and feed-forward structures.
arXiv Detail & Related papers (2024-02-29T02:55:26Z)
Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models [117.20416338476856]
Large language models (LLMs) demonstrate remarkable multilingual capabilities without being pre-trained on specially curated multilingual parallel corpora. We propose a novel detection method, language activation probability entropy (LAPE), to identify language-specific neurons within LLMs. Our findings indicate that LLMs' proficiency in processing a particular language is predominantly due to a small subset of neurons.
arXiv Detail & Related papers (2024-02-26T09:36:05Z)
Zero-Shot Cross-Lingual Reranking with Large Language Models for Low-Resource Languages [51.301942056881146]
We investigate how large language models (LLMs) function as rerankers in cross-lingual information retrieval systems for African languages. Our implementation covers English and four African languages (Hausa, Somali, Swahili, and Yoruba) We examine cross-lingual reranking with queries in English and passages in the African languages.
arXiv Detail & Related papers (2023-12-26T18:38:54Z)
PromptCBLUE: A Chinese Prompt Tuning Benchmark for the Medical Domain [24.411904114158673]
We re-build the Chinese Biomedical Language Understanding Evaluation (CBlue) benchmark into a large scale prompt-tuning benchmark, PromptCBlue. Our benchmark is a suitable test-bed and an online platform for evaluating Chinese LLMs' multi-task capabilities on a wide range bio-medical tasks.
arXiv Detail & Related papers (2023-10-22T02:20:38Z)
Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries [31.82249599013959]
Large language models (LLMs) are transforming the ways the general public accesses and consumes information. LLMs demonstrate impressive language understanding and generation proficiencies, but concerns regarding their safety remain paramount. It remains unclear how these LLMs perform in the context of non-English languages.
arXiv Detail & Related papers (2023-10-19T20:02:40Z)
Don't Trust ChatGPT when Your Question is not in English: A Study of Multilingual Abilities and Types of LLMs [16.770697902481107]
Large Language Models (LLMs) have demonstrated exceptional natural language understanding abilities. We propose a systematic way of qualifying the performance disparities of LLMs under multilingual settings. The results show that GPT exhibits highly translating-like behaviour in multilingual settings.
arXiv Detail & Related papers (2023-05-24T02:05:03Z)
Not All Languages Are Created Equal in LLMs: Improving Multilingual Capability by Cross-Lingual-Thought Prompting [123.16452714740106]
Large language models (LLMs) demonstrate impressive multilingual capability, but their performance varies substantially across different languages. We introduce a simple yet effective method, called cross-lingual-thought prompting (XLT) XLT is a generic template prompt that stimulates cross-lingual and logical reasoning skills to enhance task performance across languages.
arXiv Detail & Related papers (2023-05-11T17:44:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.