Bactrainus: Optimizing Large Language Models for Multi-hop Complex Question Answering Tasks
- URL: http://arxiv.org/abs/2501.06286v1
- Date: Fri, 10 Jan 2025 18:44:06 GMT
- Title: Bactrainus: Optimizing Large Language Models for Multi-hop Complex Question Answering Tasks
- Authors: Iman Barati, Arash Ghafouri, Behrouz Minaei-Bidgoli,
- Abstract summary: We evaluate the ability of large language models in performing domain-specific tasks using the HotpotQA dataset.
This task serves as a challenging benchmark for assessing the language comprehension capabilities of these models.
The results of the study show that the integration of large language models with these techniques can lead to up to a 4% improvement in F1 score for finding answers.
- Score: 5.439505575097552
- License:
- Abstract: In recent years, the use of large language models (LLMs) has significantly increased, and these models have demonstrated remarkable performance in a variety of general language tasks. However, the evaluation of their performance in domain-specific tasks, particularly those requiring deep natural language understanding, has received less attention. In this research, we evaluate the ability of large language models in performing domain-specific tasks, focusing on the multi-hop question answering (MHQA) problem using the HotpotQA dataset. This task, due to its requirement for reasoning and combining information from multiple textual sources, serves as a challenging benchmark for assessing the language comprehension capabilities of these models. To tackle this problem, we have designed a two-stage selector-reader architecture, where each stage utilizes an independent LLM. In addition, methods such as Chain of Thought (CoT) and question decomposition have been employed to investigate their impact on improving the model's performance. The results of the study show that the integration of large language models with these techniques can lead to up to a 4% improvement in F1 score for finding answers, providing evidence of the models' ability to handle domain-specific tasks and their understanding of complex language.
Related papers
- Demystifying Multilingual Chain-of-Thought in Process Reward Modeling [71.12193680015622]
We tackle the challenge of extending process reward models (PRMs) to multilingual settings.
We train multilingual PRMs on a dataset spanning seven languages, which is translated from English.
Our results highlight the sensitivity of multilingual PRMs to both the number of training languages and the volume of English data.
arXiv Detail & Related papers (2025-02-18T09:11:44Z) - Multilingual State Space Models for Structured Question Answering in Indic Languages [2.591667713953504]
This paper explores the application of State Space Models (SSMs) to build efficient and contextually aware QA systems tailored for Indic languages.
SSMs are particularly suited for this task due to their ability to model long-term and short-term dependencies in sequential data.
Our results demonstrate that these models effectively capture linguistic subtleties, leading to significant improvements in question interpretation, context alignment, and answer generation.
arXiv Detail & Related papers (2025-02-01T19:53:02Z) - P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning.
Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks.
We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks.
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z) - SandboxAQ's submission to MRL 2024 Shared Task on Multi-lingual Multi-task Information Retrieval [1.2629889435114405]
This paper explores the problems of Question Answering (QA) and Named Entity Recognition (NER) in five diverse languages.
We tested five Large Language Models with various prompting methods, including zero-shot, chain-of-thought reasoning, and translation techniques.
Our results show that while some models consistently outperform others, their effectiveness varies significantly across tasks and languages.
arXiv Detail & Related papers (2024-10-28T20:15:45Z) - Sign of the Times: Evaluating the use of Large Language Models for Idiomaticity Detection [2.2724928083094196]
This work looks at the performance of a range of LLMs on three idiomaticity datasets: SemEval 2022 Task 2a, FLUTE, and MAGPIE.
We find that whilst these models do give competitive performance, they do not match the results of fine-tuned task-specific models, even at the largest scales.
arXiv Detail & Related papers (2024-05-15T11:55:14Z) - The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance.
Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes.
We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z) - Mixture-of-Instructions: Aligning Large Language Models via Mixture Prompting [7.103987978402038]
We introduce a novel technique termed Mixture-of-Instructions (MoI)
MoI employs a strategy of instruction packing combined with diverse system prompts to boost the alignment efficiency of language models.
Our methodology was applied to the open-source Qwen-7B-chat model, culminating in the development of Qwen-SFT-MoI.
arXiv Detail & Related papers (2024-04-29T03:58:12Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Specializing Multilingual Language Models: An Empirical Study [50.7526245872855]
Contextualized word representations from pretrained multilingual language models have become the de facto standard for addressing natural language tasks.
For languages rarely or never seen by these models, directly using such models often results in suboptimal representation or use of data.
arXiv Detail & Related papers (2021-06-16T18:13:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.