Related papers: MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues

MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues

URL: http://arxiv.org/abs/2402.14762v3
Date: Tue, 05 Nov 2024 16:40:21 GMT
Title: MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues
Authors: Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, Wanli Ouyang,
Abstract summary: MT-Bench-101 is designed to evaluate the fine-grained abilities of Large Language Models (LLMs) in multi-turn dialogues. We construct a three-tier hierarchical ability taxonomy comprising 4208 turns across 1388 multi-turn dialogues in 13 distinct tasks. We then evaluate 21 popular LLMs based on MT-Bench-101, conducting comprehensive analyses from both ability and task perspectives.
Score: 58.33076950775072
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The advent of Large Language Models (LLMs) has drastically enhanced dialogue systems. However, comprehensively evaluating the dialogue abilities of LLMs remains a challenge. Previous benchmarks have primarily focused on single-turn dialogues or provided coarse-grained and incomplete assessments of multi-turn dialogues, overlooking the complexity and fine-grained nuances of real-life dialogues. To address this issue, we introduce MT-Bench-101, specifically designed to evaluate the fine-grained abilities of LLMs in multi-turn dialogues. By conducting a detailed analysis of real multi-turn dialogue data, we construct a three-tier hierarchical ability taxonomy comprising 4208 turns across 1388 multi-turn dialogues in 13 distinct tasks. We then evaluate 21 popular LLMs based on MT-Bench-101, conducting comprehensive analyses from both ability and task perspectives and observing differing trends in LLMs performance across dialogue turns within various tasks. Further analysis indicates that neither utilizing common alignment techniques nor chat-specific designs has led to obvious enhancements in the multi-turn abilities of LLMs. Extensive case studies suggest that our designed tasks accurately assess the corresponding multi-turn abilities. The data and code are available at \url{https://github.com/mtbench101/mt-bench-101}.

Related papers

D-SMART: Enhancing LLM Dialogue Consistency via Dynamic Structured Memory And Reasoning Tree [22.420810089099614]
Large Language Models (LLMs) often exhibit factual inconsistencies and logical decay in extended, multi-turn dialogues.<n>We propose D--101, a model-agnostic framework designed to maintain multi-turn dialogue consistency.<n>We introduce new NLI-based metrics to better measure multiturn dialogue consistency.
arXiv Detail & Related papers (2025-10-15T09:53:11Z)
KnowMT-Bench: Benchmarking Knowledge-Intensive Long-Form Question Answering in Multi-Turn Dialogues [58.305425399644086]
Multi-Turn Long-Form Question Answering (MT-LFQA) is a key application paradigm of Large Language Models (LLMs) in knowledge-intensive domains.<n>We introduce textbfKnowMT-Bench, the textitfirst-ever benchmark designed to systematically evaluate MT-LFQA for LLMs across knowledge-intensive fields.
arXiv Detail & Related papers (2025-09-26T04:32:29Z)
MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation [49.12071445991853]
Large Language Models (textbfLLMs) have been widely adopted in real-world dialogue applications.<n>MARS-Bench is constructed from play-by-play text commentary so to feature realistic dialogues.<n>Experiments on MARS-Bench also reveal that closed-source LLMs significantly outperform open-source alternatives.
arXiv Detail & Related papers (2025-05-27T10:28:04Z)
An LLM Benchmark for Addressee Recognition in Multi-modal Multi-party Dialogue [21.938414385824903]
This paper focuses on the task of addressee recognition, identifying who is being addressed to take the next turn. A subset of the corpus was annotated with addressee information, revealing that explicit addressees are indicated in approximately 20% of conversational turns.
arXiv Detail & Related papers (2025-01-28T02:27:55Z)
Can xLLMs Understand the Structure of Dialog? Exploring Multilingual Response Generation in Complex Scenarios [8.131774353504472]
We introduce XMP, a high-quality parallel Multilingual dataset sourced from Multi-party Podcast dialogues. Each sample in the dataset features at least three participants discussing a wide range of topics, including society, culture, politics, and entertainment. We uncover significant limitations in previously recognized multilingual capabilities of LLMs when applied to such complex dialogue scenarios.
arXiv Detail & Related papers (2025-01-20T04:33:03Z)
FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs [8.37667737406383]
We propose a fairness benchmark for large language model (LLM)-based chatbots in multi-turn dialogue scenarios, textbfFairMT-Bench. To ensure coverage of diverse bias types and attributes, we employ our template to construct a multi-turn dialogue dataset, textttFairMT-10K. Experiments and analyses on textttFairMT-10K reveal that in multi-turn dialogue scenarios, current LLMs are more likely to generate biased responses, and there is significant variation in performance across different tasks and models.
arXiv Detail & Related papers (2024-10-25T06:06:31Z)
RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues [8.036117602566074]
RAD-Bench is a benchmark designed to evaluate Large Language Models' capabilities in multi-turn dialogues following retrievals. Our evaluation results on commonly used LLMs reveal that model performance deteriorates as additional layers of conditions or constraints are applied.
arXiv Detail & Related papers (2024-09-19T08:26:45Z)
DivTOD: Unleashing the Power of LLMs for Diversifying Task-Oriented Dialogue Representations [21.814490079113323]
Language models pre-trained on general text have achieved impressive results in diverse fields. Yet, the distinct linguistic characteristics of task-oriented dialogues (TOD) compared to general text limit the practical utility of existing language models. We propose a novel dialogue pre-training model called DivTOD, which collaborates with LLMs to learn diverse task-oriented dialogue representations.
arXiv Detail & Related papers (2024-03-31T04:36:57Z)
Exploring the Factual Consistency in Dialogue Comprehension of Large Language Models [51.75805497456226]
This work focuses on the factual consistency issue with the help of the dialogue summarization task. Our evaluation shows that, on average, 26.8% of the summaries generated by LLMs contain factual inconsistency. To stimulate and enhance the dialogue comprehension ability of LLMs, we propose a fine-tuning paradigm with auto-constructed multi-task data.
arXiv Detail & Related papers (2023-11-13T09:32:12Z)
BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues [72.65163468440434]
This report provides a preliminary evaluation of existing large language models for human-style multi-turn chatting. We prompt large language models (LLMs) to generate a full multi-turn dialogue based on the ChatSEED, utterance by utterance. We find GPT-4 can generate human-style multi-turn dialogues with impressive quality, significantly outperforms its counterparts.
arXiv Detail & Related papers (2023-10-20T16:53:51Z)
Self-Explanation Prompting Improves Dialogue Understanding in Large Language Models [52.24756457516834]
We propose a novel "Self-Explanation" prompting strategy to enhance the comprehension abilities of Large Language Models (LLMs) This task-agnostic approach requires the model to analyze each dialogue utterance before task execution, thereby improving performance across various dialogue-centric tasks. Experimental results from six benchmark datasets confirm that our method consistently outperforms other zero-shot prompts and matches or exceeds the efficacy of few-shot prompts.
arXiv Detail & Related papers (2023-09-22T15:41:34Z)
Prompting and Evaluating Large Language Models for Proactive Dialogues: Clarification, Target-guided, and Non-collaboration [72.04629217161656]
This work focuses on three aspects of proactive dialogue systems: clarification, target-guided, and non-collaborative dialogues. To trigger the proactivity of LLMs, we propose the Proactive Chain-of-Thought prompting scheme.
arXiv Detail & Related papers (2023-05-23T02:49:35Z)
Cue-CoT: Chain-of-thought Prompting for Responding to In-depth Dialogue Questions with LLMs [59.74002011562726]
We propose a novel linguistic cue-based chain-of-thoughts (textitCue-CoT) to provide a more personalized and engaging response. We build a benchmark with in-depth dialogue questions, consisting of 6 datasets in both Chinese and English. Empirical results demonstrate our proposed textitCue-CoT method outperforms standard prompting methods in terms of both textithelpfulness and textitacceptability on all datasets.
arXiv Detail & Related papers (2023-05-19T16:27:43Z)
A Mixture-of-Expert Approach to RL-based Dialogue Management [56.08449336469477]
We use reinforcement learning to develop a dialogue agent that avoids being short-sighted (outputting generic utterances) and maximizes overall user satisfaction. Most existing RL approaches to DM train the agent at the word-level, and thus, have to deal with aly complex action space even for a medium-size vocabulary. We develop a RL-based DM using a novel mixture of expert language model (MoE-LM) that consists of (i) a LM capable of learning diverse semantics for conversation histories, (ii) a number of specialized LMs (or experts) capable of generating utterances corresponding to a
arXiv Detail & Related papers (2022-05-31T19:00:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.