Related papers: JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models

JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models

URL: http://arxiv.org/abs/2601.01627v1
Date: Sun, 04 Jan 2026 18:18:18 GMT
Title: JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models
Authors: Junyu Liu, Zirui Li, Qian Niu, Zequn Zhang, Yue Xun, Wenlong Hou, Shujun Wang, Yusuke Iwasawa, Yutaka Matsuo, Kan Hatakeyama-Sato,
Abstract summary: We introduce JMedEthicBench, the first multi-turn conversational benchmark for evaluating medical safety of Large Language Models.<n>Using a dual-LLM scoring protocol, we evaluate 27 models and find that commercial models maintain robust safety while medical-specialized models exhibit increased vulnerability.
Score: 47.20100799532625
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As Large Language Models (LLMs) are increasingly deployed in healthcare field, it becomes essential to carefully evaluate their medical safety before clinical use. However, existing safety benchmarks remain predominantly English-centric, and test with only single-turn prompts despite multi-turn clinical consultations. To address these gaps, we introduce JMedEthicBench, the first multi-turn conversational benchmark for evaluating medical safety of LLMs for Japanese healthcare. Our benchmark is based on 67 guidelines from the Japan Medical Association and contains over 50,000 adversarial conversations generated using seven automatically discovered jailbreak strategies. Using a dual-LLM scoring protocol, we evaluate 27 models and find that commercial models maintain robust safety while medical-specialized models exhibit increased vulnerability. Furthermore, safety scores decline significantly across conversation turns (median: 9.5 to 5.0, $p < 0.001$). Cross-lingual evaluation on both Japanese and English versions of our benchmark reveals that medical model vulnerabilities persist across languages, indicating inherent alignment limitations rather than language-specific factors. These findings suggest that domain-specific fine-tuning may accidentally weaken safety mechanisms and that multi-turn interactions represent a distinct threat surface requiring dedicated alignment strategies.

Related papers

CLINIC: Evaluating Multilingual Trustworthiness in Language Models for Healthcare [25.074475493111162]
We present CLINIC, a Comprehensive Benchmark to evaluate the trustworthiness of language models in healthcare.<n>Our evaluation reveals that LMs struggle with factual correctness, demonstrate bias across demographic and linguistic groups, and are susceptible to privacy breaches and adversarial attacks.
arXiv Detail & Related papers (2025-12-12T10:19:27Z)
MedPT: A Massive Medical Question Answering Dataset for Brazilian-Portuguese Speakers [35.41469674626373]
We introduce MedPT, the first large-scale, real-world corpus for Brazilian Portuguese.<n>It comprises 384,095 authentic question-answer pairs from patient-doctor interactions.<n>Our analysis reveals its thematic breadth (3,200 topics) and unique linguistic properties, like the natural asymmetry in patient-doctor communication.
arXiv Detail & Related papers (2025-11-14T21:13:28Z)
Filling in the Clinical Gaps in Benchmark: Case for HealthBench for the Japanese medical system [5.7880565661958565]
This study investigates the applicability of HealthBench to the Japanese context.<n> resources in Japanese are scarce and often consist of translated multiple-choice questions.
arXiv Detail & Related papers (2025-09-22T07:36:12Z)
MedOmni-45°: A Safety-Performance Benchmark for Reasoning-Oriented LLMs in Medicine [69.08855631283829]
We introduce Med Omni-45 Degrees, a benchmark designed to quantify safety-performance trade-offs under manipulative hint conditions.<n>It contains 1,804 reasoning-focused medical questions across six specialties and three task types, including 500 from MedMCQA.<n>Results show a consistent safety-performance trade-off, with no model surpassing the diagonal.
arXiv Detail & Related papers (2025-08-22T08:38:16Z)
Medical Red Teaming Protocol of Language Models: On the Importance of User Perspectives in Healthcare Settings [48.096652370210016]
We introduce a safety evaluation protocol tailored to the medical domain in both patient user and clinician user perspectives.<n>This is the first work to define safety evaluation criteria for medical LLMs through targeted red-teaming taking three different points of view.
arXiv Detail & Related papers (2025-07-09T19:38:58Z)
SafeDialBench: A Fine-Grained Safety Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks [90.41592442792181]
We propose a fine-grained benchmark SafeDialBench for evaluating the safety of Large Language Models (LLMs)<n>Specifically, we design a two-tier hierarchical safety taxonomy that considers 6 safety dimensions and generates more than 4000 multi-turn dialogues in both Chinese and English under 22 dialogue scenarios.<n> Notably, we construct an innovative assessment framework of LLMs, measuring capabilities in detecting, and handling unsafe information and maintaining consistency when facing jailbreak attacks.
arXiv Detail & Related papers (2025-02-16T12:08:08Z)
Ensuring Safety and Trust: Analyzing the Risks of Large Language Models in Medicine [41.71754418349046]
We propose five key principles for safe and trustworthy medical AI, along with ten specific aspects. Under this comprehensive framework, we introduce a novel MedGuard benchmark with 1,000 expert-verified questions. Our evaluation of 11 commonly used LLMs shows that the current language models, regardless of their safety alignment mechanisms, generally perform poorly on most of our benchmarks. This study underscores a significant safety gap, highlighting the crucial need for human oversight and the implementation of AI safety guardrails.
arXiv Detail & Related papers (2024-11-20T06:34:32Z)
Eir: Thai Medical Large Language Models [0.0]
Eir-8B is a large language model with 8 billion parameters designed to enhance the accuracy of handling medical tasks in the Thai language. Human evaluation was conducted to ensure that the model adheres to care standards and provides unbiased answers. The model is deployed within the hospital's internal network, ensuring both high security and faster processing speeds.
arXiv Detail & Related papers (2024-09-13T04:06:00Z)
Towards Building Multilingual Language Model for Medicine [54.1382395897071]
We construct a multilingual medical corpus, containing approximately 25.5B tokens encompassing 6 main languages. We propose a multilingual medical multi-choice question-answering benchmark with rationale, termed as MMedBench. Our final model, MMed-Llama 3, with only 8B parameters, achieves superior performance compared to all other open-source models on both MMedBench and English benchmarks.
arXiv Detail & Related papers (2024-02-21T17:47:20Z)
Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings. We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.