Performance Evaluation of Large Language Models in Bangla Consumer Health Query Summarization
- URL: http://arxiv.org/abs/2505.05070v1
- Date: Thu, 08 May 2025 09:06:28 GMT
- Title: Performance Evaluation of Large Language Models in Bangla Consumer Health Query Summarization
- Authors: Ajwad Abrar, Farzana Tabassum, Sabbir Ahmed,
- Abstract summary: This study investigates the zero-shot performance of nine advanced large language models (LLMs)<n>We benchmarked these LLMs using ROUGE metrics against Bangla T5, a fine-tuned state-of-the-art model.<n>The results demonstrate that zero-shot LLMs can rival fine-tuned models, achieving high-quality summaries even without task-specific training.
- Score: 1.2289361708127877
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Consumer Health Queries (CHQs) in Bengali (Bangla), a low-resource language, often contain extraneous details, complicating efficient medical responses. This study investigates the zero-shot performance of nine advanced large language models (LLMs): GPT-3.5-Turbo, GPT-4, Claude-3.5-Sonnet, Llama3-70b-Instruct, Mixtral-8x22b-Instruct, Gemini-1.5-Pro, Qwen2-72b-Instruct, Gemma-2-27b, and Athene-70B, in summarizing Bangla CHQs. Using the BanglaCHQ-Summ dataset comprising 2,350 annotated query-summary pairs, we benchmarked these LLMs using ROUGE metrics against Bangla T5, a fine-tuned state-of-the-art model. Mixtral-8x22b-Instruct emerged as the top performing model in ROUGE-1 and ROUGE-L, while Bangla T5 excelled in ROUGE-2. The results demonstrate that zero-shot LLMs can rival fine-tuned models, achieving high-quality summaries even without task-specific training. This work underscores the potential of LLMs in addressing challenges in low-resource languages, providing scalable solutions for healthcare query summarization.
Related papers
- An Empirical Study of Many-to-Many Summarization with Large Language Models [82.10000188179168]
Large language models (LLMs) have shown strong multi-lingual abilities, giving them the potential to perform Many-to-many summarization (M2MS) in real applications.<n>This work presents a systematic empirical study on LLMs' M2MS ability.
arXiv Detail & Related papers (2025-05-19T11:18:54Z) - Empowering Smaller Models: Tuning LLaMA and Gemma with Chain-of-Thought for Ukrainian Exam Tasks [0.0]
Small or compact models, though more efficient, often lack sufficient support for underrepresented languages.<n>This work explores the potential of parameter-efficient fine-tuning of compact open-weight language models to handle reasoning-intensive tasks.<n> tuning method with joint task topic and step-by-step solution generation outperforms standard chain-of-thought tuning in matching tasks.
arXiv Detail & Related papers (2025-03-18T07:44:49Z) - TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking [6.070192392563392]
We present TituLLMs, the first large pretrained Bangla LLMs, available in 1b and 3b parameter sizes.<n>To train TituLLMs, we collected a pretraining dataset of approximately 37 billion tokens.<n>We extended the Llama-3.2 tokenizer to incorporate language- and culture-specific knowledge.
arXiv Detail & Related papers (2025-02-16T16:22:23Z) - Generalists vs. Specialists: Evaluating Large Language Models for Urdu [4.8539869147159616]
We compare general-purpose models, GPT-4-Turbo and Llama-3-8b, with special-purpose models--XLM-Roberta-large, mT5-large, and Llama-3-8b--that have been fine-tuned on specific tasks.
We focus on seven classification and seven generation tasks to evaluate the performance of these models on Urdu language.
arXiv Detail & Related papers (2024-07-05T12:09:40Z) - DataComp-LM: In search of the next generation of training sets for language models [200.5293181577585]
DataComp for Language Models (DCLM) is a testbed for controlled dataset experiments with the goal of improving language models.<n>We provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations.<n>Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters.
arXiv Detail & Related papers (2024-06-17T17:42:57Z) - Benchmarking the Performance of Pre-trained LLMs across Urdu NLP Tasks [0.9786690381850356]
This study presents in-depth examination of 7 prominent Large Language Models (LLMs) across 17 tasks using 22 datasets, 13.8 hours of speech, in a zero-shot setting, and their performance against state-of-the-art (SOTA) models.<n>Our results emphasize that models with fewer parameters but richer language-specific data, like Llama 3.1-8B, often outperform larger models with lower language diversity, such as GPT-3.5, in several tasks.
arXiv Detail & Related papers (2024-05-24T11:30:37Z) - TriSum: Learning Summarization Ability from Large Language Models with Structured Rationale [66.01943465390548]
We introduce TriSum, a framework for distilling large language models' text summarization abilities into a compact, local model.
Our method enhances local model performance on various benchmarks.
It also improves interpretability by providing insights into the summarization rationale.
arXiv Detail & Related papers (2024-03-15T14:36:38Z) - YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters.
YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline.
The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z) - Branch-Solve-Merge Improves Large Language Model Evaluation and Generation [136.7876524839751]
Large Language Models (LLMs) are frequently used for multi-faceted language generation and evaluation tasks.
We propose Branch-Merge (BSM), a Large Language Model program (Schlag et al., 2023) for tackling such challenging natural language tasks.
BSM improves the evaluation correctness and consistency for each LLM by enhancing human-LLM agreement by up to 26%.
arXiv Detail & Related papers (2023-10-23T17:29:48Z) - Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting [65.00288634420812]
Pairwise Ranking Prompting (PRP) is a technique to significantly reduce the burden on Large Language Models (LLMs)
Our results are the first in the literature to achieve state-of-the-art ranking performance on standard benchmarks using moderate-sized open-sourced LLMs.
arXiv Detail & Related papers (2023-06-30T11:32:25Z) - BanglaNLG: Benchmarks and Resources for Evaluating Low-Resource Natural
Language Generation in Bangla [21.47743471497797]
This work presents a benchmark for evaluating natural language generation models in Bangla.
We aggregate three challenging conditional text generation tasks under the BanglaNLG benchmark.
Using a clean corpus of 27.5 GB of Bangla data, we pretrain BanglaT5, a sequence-to-sequence Transformer model for Bangla.
BanglaT5 achieves state-of-the-art performance in all of these tasks, outperforming mT5 (base) by up to 5.4%.
arXiv Detail & Related papers (2022-05-23T06:54:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.