UM_FHS at TREC 2024 PLABA: Exploration of Fine-tuning and AI agent approach for plain language adaptations of biomedical text
- URL: http://arxiv.org/abs/2502.14144v1
- Date: Wed, 19 Feb 2025 23:07:16 GMT
- Title: UM_FHS at TREC 2024 PLABA: Exploration of Fine-tuning and AI agent approach for plain language adaptations of biomedical text
- Authors: Primoz Kocbek, Leon Kopitar, Zhihong Zhang, Emirhan Aydin, Maxim Topaz, Gregor Stiglic,
- Abstract summary: This paper describes our submissions to the TREC 2024 PLABA track with the aim to simplify biomedical abstracts for a K8-level audience (13-14 years old students).
We tested three approaches using OpenAI's GPt-4o and GPt-4o-mini models: baseline prompt engineering, a two-AI agent approach, and fine-tuning.
Results indicated that the two-agent approach and baseline prompt engineering with GPt-4o-mini models show superior qualitative performance, while fine-tuned models excelled in accuracy and completeness but were less simple.
- Score: 3.223303935767146
- License:
- Abstract: This paper describes our submissions to the TREC 2024 PLABA track with the aim to simplify biomedical abstracts for a K8-level audience (13-14 years old students). We tested three approaches using OpenAI's gpt-4o and gpt-4o-mini models: baseline prompt engineering, a two-AI agent approach, and fine-tuning. Adaptations were evaluated using qualitative metrics (5-point Likert scales for simplicity, accuracy, completeness, and brevity) and quantitative readability scores (Flesch-Kincaid grade level, SMOG Index). Results indicated that the two-agent approach and baseline prompt engineering with gpt-4o-mini models show superior qualitative performance, while fine-tuned models excelled in accuracy and completeness but were less simple. The evaluation results demonstrated that prompt engineering with gpt-4o-mini outperforms iterative improvement strategies via two-agent approach as well as fine-tuning with gpt-4o. We intend to expand our investigation of the results and explore advanced evaluations.
Related papers
- Language Models are Few-Shot Graders [0.12289361708127876]
We present an ASAG pipeline leveraging state-of-the-art LLMs.
We compare the grading performance of three OpenAI models: GPT-4, GPT-4o, and o1-preview.
Our findings indicate that providing graded examples enhances grading accuracy, with RAG-based selection outperforming random selection.
arXiv Detail & Related papers (2025-02-18T23:38:21Z) - Empirical evaluation of LLMs in predicting fixes of Configuration bugs in Smart Home System [0.0]
This study evaluates the effectiveness of Large Language Models (LLMs) in predicting fixes for configuration bugs in smart home systems.
The research analyzes three prominent LLMs - GPT-4, GPT-4o (GPT-4 Turbo), and Claude 3.5 Sonnet.
arXiv Detail & Related papers (2025-02-16T02:11:36Z) - Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark [62.58869921806019]
We propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset.
We design innovative training strategies to effectively distill GPT-4o's evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6.
Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline.
arXiv Detail & Related papers (2024-11-23T08:06:06Z) - Exploring Foundation Models Fine-Tuning for Cytology Classification [0.10555513406636088]
We show how existing foundation models can be applied to cytological classification.
We evaluate five foundation models across four cytological classification datasets.
Our results demonstrate that fine-tuning the pre-trained backbones with LoRA significantly improves model performance.
arXiv Detail & Related papers (2024-11-22T14:34:04Z) - Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams [48.99818550820575]
We leverage state-of-the-art multi-modal AI models, in particular GPT-4o, to automatically grade handwritten responses to college-level math exams.
Using real student responses to questions in a probability theory exam, we evaluate GPT-4o's alignment with ground-truth scores from human graders using various prompting techniques.
arXiv Detail & Related papers (2024-11-07T22:51:47Z) - Edinburgh Clinical NLP at SemEval-2024 Task 2: Fine-tune your model unless you have access to GPT-4 [10.01547158445743]
We evaluate various Large Language Models (LLMs) with multiple strategies, including Chain-of-Thought, In-Context Learning, and Efficient Fine-Tuning (PEFT)
We found that the two PEFT adapters improves the F1 score (+0.0346) and consistency (+0.152) of the LLMs.
Averaging the three metrics, GPT-4 ranks joint-first in the competition with 0.8328.
arXiv Detail & Related papers (2024-03-30T22:27:21Z) - Gemini vs GPT-4V: A Preliminary Comparison and Combination of
Vision-Language Models Through Qualitative Cases [98.35348038111508]
This paper presents an in-depth comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision)
The core of our analysis delves into the distinct visual comprehension abilities of each model.
Our findings illuminate the unique strengths and niches of both models.
arXiv Detail & Related papers (2023-12-22T18:59:58Z) - DiffNAS: Bootstrapping Diffusion Models by Prompting for Better
Architectures [63.12993314908957]
We propose a base model search approach, denoted "DiffNAS"
We leverage GPT-4 as a supernet to expedite the search, supplemented with a search memory to enhance the results.
Rigorous experimentation corroborates that our algorithm can augment the search efficiency by 2 times under GPT-based scenarios.
arXiv Detail & Related papers (2023-10-07T09:10:28Z) - The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) [121.42924593374127]
We analyze the latest model, GPT-4V, to deepen the understanding of LMMs.
GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs makes it a powerful multimodal generalist system.
GPT-4V's unique capability of understanding visual markers drawn on input images can give rise to new human-computer interaction methods.
arXiv Detail & Related papers (2023-09-29T17:34:51Z) - Adversarial Fine-Tuning of Language Models: An Iterative Optimisation
Approach for the Generation and Detection of Problematic Content [0.0]
We tackle the emerging challenge of unintended harmful content generation in Large Language Models (LLMs)
Our two-pronged approach employs an adversarial model, fine-tuned to generate potentially harmful prompts, and a judge model, iteratively optimised to discern these prompts.
We show that a rudimentary model textttada can achieve 13% higher accuracy on the hold-out test set than GPT-4 after only a few rounds of this process.
arXiv Detail & Related papers (2023-08-26T05:20:58Z) - GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.