Related papers: Performance Evaluation of Large Language Models in Bangla Consumer Health Query Summarization

Performance Evaluation of Large Language Models in Bangla Consumer Health Query Summarization

URL: http://arxiv.org/abs/2505.05070v1
Date: Thu, 08 May 2025 09:06:28 GMT
Title: Performance Evaluation of Large Language Models in Bangla Consumer Health Query Summarization
Authors: Ajwad Abrar, Farzana Tabassum, Sabbir Ahmed,
Abstract summary: This study investigates the zero-shot performance of nine advanced large language models (LLMs)<n>We benchmarked these LLMs using ROUGE metrics against Bangla T5, a fine-tuned state-of-the-art model.<n>The results demonstrate that zero-shot LLMs can rival fine-tuned models, achieving high-quality summaries even without task-specific training.
Score: 1.2289361708127877
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Consumer Health Queries (CHQs) in Bengali (Bangla), a low-resource language, often contain extraneous details, complicating efficient medical responses. This study investigates the zero-shot performance of nine advanced large language models (LLMs): GPT-3.5-Turbo, GPT-4, Claude-3.5-Sonnet, Llama3-70b-Instruct, Mixtral-8x22b-Instruct, Gemini-1.5-Pro, Qwen2-72b-Instruct, Gemma-2-27b, and Athene-70B, in summarizing Bangla CHQs. Using the BanglaCHQ-Summ dataset comprising 2,350 annotated query-summary pairs, we benchmarked these LLMs using ROUGE metrics against Bangla T5, a fine-tuned state-of-the-art model. Mixtral-8x22b-Instruct emerged as the top performing model in ROUGE-1 and ROUGE-L, while Bangla T5 excelled in ROUGE-2. The results demonstrate that zero-shot LLMs can rival fine-tuned models, achieving high-quality summaries even without task-specific training. This work underscores the potential of LLMs in addressing challenges in low-resource languages, providing scalable solutions for healthcare query summarization.

Related papers

An Empirical Study of Many-to-Many Summarization with Large Language Models [82.10000188179168]
Large language models (LLMs) have shown strong multi-lingual abilities, giving them the potential to perform Many-to-many summarization (M2MS) in real applications.<n>This work presents a systematic empirical study on LLMs' M2MS ability.
arXiv Detail & Related papers (2025-05-19T11:18:54Z)
Empowering Smaller Models: Tuning LLaMA and Gemma with Chain-of-Thought for Ukrainian Exam Tasks [0.0]
Small or compact models, though more efficient, often lack sufficient support for underrepresented languages.<n>This work explores the potential of parameter-efficient fine-tuning of compact open-weight language models to handle reasoning-intensive tasks.<n> tuning method with joint task topic and step-by-step solution generation outperforms standard chain-of-thought tuning in matching tasks.
arXiv Detail & Related papers (2025-03-18T07:44:49Z)
TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking [6.070192392563392]
We present TituLLMs, the first large pretrained Bangla LLMs, available in 1b and 3b parameter sizes.<n>To train TituLLMs, we collected a pretraining dataset of approximately 37 billion tokens.<n>We extended the Llama-3.2 tokenizer to incorporate language- and culture-specific knowledge.
arXiv Detail & Related papers (2025-02-16T16:22:23Z)
Generalists vs. Specialists: Evaluating Large Language Models for Urdu [4.8539869147159616]
We compare general-purpose models, GPT-4-Turbo and Llama-3-8b, with special-purpose models--XLM-Roberta-large, mT5-large, and Llama-3-8b--that have been fine-tuned on specific tasks. We focus on seven classification and seven generation tasks to evaluate the performance of these models on Urdu language.
arXiv Detail & Related papers (2024-07-05T12:09:40Z)
DataComp-LM: In search of the next generation of training sets for language models [200.5293181577585]
DataComp for Language Models (DCLM) is a testbed for controlled dataset experiments with the goal of improving language models.<n>We provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations.<n>Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters.
arXiv Detail & Related papers (2024-06-17T17:42:57Z)
Benchmarking the Performance of Pre-trained LLMs across Urdu NLP Tasks [0.9786690381850356]
This study presents in-depth examination of 7 prominent Large Language Models (LLMs) across 17 tasks using 22 datasets, 13.8 hours of speech, in a zero-shot setting, and their performance against state-of-the-art (SOTA) models.<n>Our results emphasize that models with fewer parameters but richer language-specific data, like Llama 3.1-8B, often outperform larger models with lower language diversity, such as GPT-3.5, in several tasks.
arXiv Detail & Related papers (2024-05-24T11:30:37Z)
TriSum: Learning Summarization Ability from Large Language Models with Structured Rationale [66.01943465390548]
We introduce TriSum, a framework for distilling large language models' text summarization abilities into a compact, local model. Our method enhances local model performance on various benchmarks. It also improves interpretability by providing insights into the summarization rationale.
arXiv Detail & Related papers (2024-03-15T14:36:38Z)
YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters. YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline. The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z)
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation [136.7876524839751]
Large Language Models (LLMs) are frequently used for multi-faceted language generation and evaluation tasks. We propose Branch-Merge (BSM), a Large Language Model program (Schlag et al., 2023) for tackling such challenging natural language tasks. BSM improves the evaluation correctness and consistency for each LLM by enhancing human-LLM agreement by up to 26%.
arXiv Detail & Related papers (2023-10-23T17:29:48Z)
Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting [65.00288634420812]
Pairwise Ranking Prompting (PRP) is a technique to significantly reduce the burden on Large Language Models (LLMs) Our results are the first in the literature to achieve state-of-the-art ranking performance on standard benchmarks using moderate-sized open-sourced LLMs.
arXiv Detail & Related papers (2023-06-30T11:32:25Z)
BanglaNLG: Benchmarks and Resources for Evaluating Low-Resource Natural Language Generation in Bangla [21.47743471497797]
This work presents a benchmark for evaluating natural language generation models in Bangla. We aggregate three challenging conditional text generation tasks under the BanglaNLG benchmark. Using a clean corpus of 27.5 GB of Bangla data, we pretrain BanglaT5, a sequence-to-sequence Transformer model for Bangla. BanglaT5 achieves state-of-the-art performance in all of these tasks, outperforming mT5 (base) by up to 5.4%.
arXiv Detail & Related papers (2022-05-23T06:54:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.