Related papers: UM_FHS at TREC 2024 PLABA: Exploration of Fine-tuning and AI agent approach for plain language adaptations of biomedical text

UM_FHS at TREC 2024 PLABA: Exploration of Fine-tuning and AI agent approach for plain language adaptations of biomedical text

URL: http://arxiv.org/abs/2502.14144v1
Date: Wed, 19 Feb 2025 23:07:16 GMT
Title: UM_FHS at TREC 2024 PLABA: Exploration of Fine-tuning and AI agent approach for plain language adaptations of biomedical text
Authors: Primoz Kocbek, Leon Kopitar, Zhihong Zhang, Emirhan Aydin, Maxim Topaz, Gregor Stiglic,
Abstract summary: This paper describes our submissions to the TREC 2024 PLABA track with the aim to simplify biomedical abstracts for a K8-level audience (13-14 years old students).<n>We tested three approaches using OpenAI's GPt-4o and GPt-4o-mini models: baseline prompt engineering, a two-AI agent approach, and fine-tuning.<n>Results indicated that the two-agent approach and baseline prompt engineering with GPt-4o-mini models show superior qualitative performance, while fine-tuned models excelled in accuracy and completeness but were less simple.
Score: 3.223303935767146
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: This paper describes our submissions to the TREC 2024 PLABA track with the aim to simplify biomedical abstracts for a K8-level audience (13-14 years old students). We tested three approaches using OpenAI's gpt-4o and gpt-4o-mini models: baseline prompt engineering, a two-AI agent approach, and fine-tuning. Adaptations were evaluated using qualitative metrics (5-point Likert scales for simplicity, accuracy, completeness, and brevity) and quantitative readability scores (Flesch-Kincaid grade level, SMOG Index). Results indicated that the two-agent approach and baseline prompt engineering with gpt-4o-mini models show superior qualitative performance, while fine-tuned models excelled in accuracy and completeness but were less simple. The evaluation results demonstrated that prompt engineering with gpt-4o-mini outperforms iterative improvement strategies via two-agent approach as well as fine-tuning with gpt-4o. We intend to expand our investigation of the results and explore advanced evaluations.

Related papers

How Good Are Synthetic Requirements ? Evaluating LLM-Generated Datasets for AI4RE [0.5156484100374059]
This paper presents an enhanced Product Line approach for generating synthetic requirements data.<n>We investigate four research questions assessing how prompting strategies, automated prompt optimization, and post-generation curation affect data quality.<n>Our results show that synthetic requirements can match or outperform human-authored ones for specific tasks.
arXiv Detail & Related papers (2025-06-26T10:52:07Z)
Weak-for-Strong: Training Weak Meta-Agent to Harness Strong Executors [104.5401871607713]
This paper proposes Weakfor-Strong Harnessing (W4S), a novel framework that customizes smaller, cost-efficient language models to design and optimize for harnessing stronger models. W4S formulates design as a multi-turn markov decision process and introduces reinforcement learning for agentic workflow optimization. Empirical results demonstrate the superiority of W4S that our 7B meta-agent, trained with just one GPU hour, outperforms the strongest baseline by 2.9% 24.6% across eleven benchmarks.
arXiv Detail & Related papers (2025-04-07T07:27:31Z)
Language Models are Few-Shot Graders [0.12289361708127876]
We present an ASAG pipeline leveraging state-of-the-art LLMs.<n>We compare the grading performance of three OpenAI models: GPT-4, GPT-4o, and o1-preview.<n>Our findings indicate that providing graded examples enhances grading accuracy, with RAG-based selection outperforming random selection.
arXiv Detail & Related papers (2025-02-18T23:38:21Z)
Empirical evaluation of LLMs in predicting fixes of Configuration bugs in Smart Home System [0.0]
This study evaluates the effectiveness of Large Language Models (LLMs) in predicting fixes for configuration bugs in smart home systems. The research analyzes three prominent LLMs - GPT-4, GPT-4o (GPT-4 Turbo), and Claude 3.5 Sonnet.
arXiv Detail & Related papers (2025-02-16T02:11:36Z)
Large language models streamline automated systematic review: A preliminary study [12.976248955642037]
Large Language Models (LLMs) have shown promise in natural language processing tasks, with the potential to automate systematic reviews. This study evaluates the performance of three state-of-the-art LLMs in conducting systematic review tasks.
arXiv Detail & Related papers (2025-01-09T01:59:35Z)
Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark [62.58869921806019]
We propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset. We design innovative training strategies to effectively distill GPT-4o's evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6. Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline.
arXiv Detail & Related papers (2024-11-23T08:06:06Z)
ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning [78.42927884000673]
ExACT is an approach to combine test-time search and self-learning to build o1-like models for agentic applications.<n>We first introduce Reflective Monte Carlo Tree Search (R-MCTS), a novel test time algorithm designed to enhance AI agents' ability to explore decision space on the fly.<n>Next, we introduce Exploratory Learning, a novel learning strategy to teach agents to search at inference time without relying on any external search algorithms.
arXiv Detail & Related papers (2024-10-02T21:42:35Z)
Edinburgh Clinical NLP at SemEval-2024 Task 2: Fine-tune your model unless you have access to GPT-4 [10.01547158445743]
We evaluate various Large Language Models (LLMs) with multiple strategies, including Chain-of-Thought, In-Context Learning, and Efficient Fine-Tuning (PEFT) We found that the two PEFT adapters improves the F1 score (+0.0346) and consistency (+0.152) of the LLMs. Averaging the three metrics, GPT-4 ranks joint-first in the competition with 0.8328.
arXiv Detail & Related papers (2024-03-30T22:27:21Z)
Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases [98.35348038111508]
This paper presents an in-depth comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision) The core of our analysis delves into the distinct visual comprehension abilities of each model. Our findings illuminate the unique strengths and niches of both models.
arXiv Detail & Related papers (2023-12-22T18:59:58Z)
Prompt Engineering or Fine-Tuning: An Empirical Assessment of LLMs for Code [7.760653867600283]
We evaluate GPT-4 using three prompt engineering strategies -- basic prompting, in-context learning, and task-specific prompting.<n>We compare it against 17 fine-tuned models across three code-related tasks: code summarization, generation, and translation.
arXiv Detail & Related papers (2023-10-11T00:21:00Z)
DiffNAS: Bootstrapping Diffusion Models by Prompting for Better Architectures [63.12993314908957]
We propose a base model search approach, denoted "DiffNAS" We leverage GPT-4 as a supernet to expedite the search, supplemented with a search memory to enhance the results. Rigorous experimentation corroborates that our algorithm can augment the search efficiency by 2 times under GPT-based scenarios.
arXiv Detail & Related papers (2023-10-07T09:10:28Z)
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) [121.42924593374127]
We analyze the latest model, GPT-4V, to deepen the understanding of LMMs. GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs makes it a powerful multimodal generalist system. GPT-4V's unique capability of understanding visual markers drawn on input images can give rise to new human-computer interaction methods.
arXiv Detail & Related papers (2023-09-29T17:34:51Z)
Adversarial Fine-Tuning of Language Models: An Iterative Optimisation Approach for the Generation and Detection of Problematic Content [0.0]
We tackle the emerging challenge of unintended harmful content generation in Large Language Models (LLMs) Our two-pronged approach employs an adversarial model, fine-tuned to generate potentially harmful prompts, and a judge model, iteratively optimised to discern these prompts. We show that a rudimentary model textttada can achieve 13% higher accuracy on the hold-out test set than GPT-4 after only a few rounds of this process.
arXiv Detail & Related papers (2023-08-26T05:20:58Z)
GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs. It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z)
LENS: A Learnable Evaluation Metric for Text Simplification [17.48383068498169]
We present LENS, a learnable evaluation metric for text simplification. We also introduce Rank and Rate, a human evaluation framework that rates simplifications from several models in a list-wise manner.
arXiv Detail & Related papers (2022-12-19T18:56:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.