OccuQuest: Mitigating Occupational Bias for Inclusive Large Language
Models
- URL: http://arxiv.org/abs/2310.16517v1
- Date: Wed, 25 Oct 2023 10:06:17 GMT
- Title: OccuQuest: Mitigating Occupational Bias for Inclusive Large Language
Models
- Authors: Mingfeng Xue, Dayiheng Liu, Kexin Yang, Guanting Dong, Wenqiang Lei,
Zheng Yuan, Chang Zhou, Jingren Zhou
- Abstract summary: We create an instruction-tuning dataset named emphOccuQuest, which contains 110,000+ prompt-completion pairs and 30,000+ dialogues covering over 1,000 occupations in 26 occupational categories.
We then fine-tune LLaMA on OccuQuest to obtain OccuLLaMA, which significantly outperforms state-of-the-art LLaMA variants on professional questions in GPT-4 and human evaluations.
- Score: 73.49209444768057
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The emergence of large language models (LLMs) has revolutionized natural
language processing tasks. However, existing instruction-tuning datasets suffer
from occupational bias: the majority of data relates to only a few occupations,
which hampers the instruction-tuned LLMs to generate helpful responses to
professional queries from practitioners in specific fields. To mitigate this
issue and promote occupation-inclusive LLMs, we create an instruction-tuning
dataset named \emph{OccuQuest}, which contains 110,000+ prompt-completion pairs
and 30,000+ dialogues covering over 1,000 occupations in 26 occupational
categories. We systematically request ChatGPT, organizing queries
hierarchically based on Occupation, Responsibility, Topic, and Question, to
ensure a comprehensive coverage of occupational specialty inquiries. By
comparing with three commonly used datasets (Dolly, ShareGPT, and WizardLM), we
observe that OccuQuest exhibits a more balanced distribution across
occupations. Furthermore, we assemble three test sets for comprehensive
evaluation, an occu-test set covering 25 occupational categories, an estate set
focusing on real estate, and an occu-quora set containing real-world questions
from Quora. We then fine-tune LLaMA on OccuQuest to obtain OccuLLaMA, which
significantly outperforms state-of-the-art LLaMA variants (Vicuna, Tulu, and
WizardLM) on professional questions in GPT-4 and human evaluations. Notably, on
the occu-quora set, OccuLLaMA reaches a high win rate of 86.4\% against
WizardLM.
Related papers
- QA-TOOLBOX: Conversational Question-Answering for process task guidance in manufacturing [6.377282332225302]
This dataset consists of representative samples of interactions with technicians working in an advanced manufacturing setting.
The dataset consists of 200,000+ question/answer pairs that refer to the spec document and are grounded in narrations and/or video demonstrations.
arXiv Detail & Related papers (2024-12-03T18:10:31Z) - CLR-Bench: Evaluating Large Language Models in College-level Reasoning [17.081788240112417]
Large language models (LLMs) have demonstrated their remarkable performance across various language understanding tasks.
We present CLR-Bench to comprehensively evaluate the LLMs in complex college-level reasoning.
arXiv Detail & Related papers (2024-10-23T04:55:08Z) - MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains [54.117238759317004]
Massive Multitask Agent Understanding (MMAU) benchmark features comprehensive offline tasks that eliminate the need for complex environment setups.
It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics.
With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents.
arXiv Detail & Related papers (2024-07-18T00:58:41Z) - MACAROON: Training Vision-Language Models To Be Your Engaged Partners [95.32771929749514]
Large vision-language models (LVLMs) generate detailed responses even when questions are ambiguous or unlabeled.
In this study, we aim to shift LVLMs from passive answer providers to proactive engaged partners.
We introduce MACAROON, self-iMaginAtion for ContrAstive pReference OptimizatiON, which instructs LVLMs to autonomously generate contrastive response pairs for unlabeled questions.
arXiv Detail & Related papers (2024-06-20T09:27:33Z) - Analyzing the Role of Semantic Representations in the Era of Large Language Models [104.18157036880287]
We investigate the role of semantic representations in the era of large language models (LLMs)
We propose an AMR-driven chain-of-thought prompting method, which we call AMRCoT.
We find that it is difficult to predict which input examples AMR may help or hurt on, but errors tend to arise with multi-word expressions.
arXiv Detail & Related papers (2024-05-02T17:32:59Z) - UnibucLLM: Harnessing LLMs for Automated Prediction of Item Difficulty and Response Time for Multiple-Choice Questions [25.877058354902953]
This work explores a novel data augmentation method based on Large Language Models (LLMs) for predicting item difficulty and response time of retired USMLE Multiple-Choice Questions (MCQs) in the BEA 2024 Shared Task.
Our approach is based on augmenting the dataset with answers from zero-shot LLMs and employing transformer-based models based on six alternative feature combinations.
arXiv Detail & Related papers (2024-04-20T10:41:02Z) - Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks.
LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z) - The Eval4NLP 2023 Shared Task on Prompting Large Language Models as
Explainable Metrics [36.52897053496835]
generative large language models (LLMs) have shown remarkable capabilities to solve tasks with minimal or no task-related examples.
We introduce the Eval4NLP 2023 shared task that asks participants to explore prompting and score extraction for machine translation (MT) and summarization evaluation.
We present an overview of participants' approaches and evaluate them on a new reference-free test set spanning three language pairs for MT and a summarization dataset.
arXiv Detail & Related papers (2023-10-30T17:55:08Z) - DORIS-MAE: Scientific Document Retrieval using Multi-level Aspect-based
Queries [2.4816250611120547]
We propose a novel task, Scientific DOcument Retrieval using Multi-level Aspect-based quEries (DORIS-MAE)
For each complex query, we assembled a collection of 100 relevant documents and produced annotated relevance scores for ranking them.
Anno-GPT is a framework for validating the performance of Large Language Models (LLMs) on expert-level dataset annotation tasks.
arXiv Detail & Related papers (2023-10-07T03:25:06Z) - Multi-Task Instruction Tuning of LLaMa for Specific Scenarios: A
Preliminary Study on Writing Assistance [60.40541387785977]
Small foundational models can display remarkable proficiency in tackling diverse tasks when fine-tuned using instruction-driven data.
In this work, we investigate a practical problem setting where the primary focus is on one or a few particular tasks rather than general-purpose instruction following.
Experimental results show that fine-tuning LLaMA on writing instruction data significantly improves its ability on writing tasks.
arXiv Detail & Related papers (2023-05-22T16:56:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.