MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large
Language Models
- URL: http://arxiv.org/abs/2401.16745v1
- Date: Tue, 30 Jan 2024 04:50:28 GMT
- Title: MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large
Language Models
- Authors: Wai-Chung Kwan, Xingshan Zeng, Yuxin Jiang, Yufei Wang, Liangyou Li,
Lifeng Shang, Xin Jiang, Qun Liu, Kam-Fai Wong
- Abstract summary: We introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn conversational abilities.
By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up.
Our evaluation of 11 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks.
- Score: 70.92847554971065
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are increasingly relied upon for complex
multi-turn conversations across diverse real-world applications. However,
existing benchmarks predominantly focus on single-turn evaluations, overlooking
the models' capabilities in multi-turn interactions. To address this gap, we
introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn
conversational abilities. By analyzing human-LLM conversations, we categorize
interaction patterns into four types: recollection, expansion, refinement, and
follow-up. We construct multi-turn queries for each category either by
augmenting existing datasets or by creating new examples with GPT-4 to avoid
data leakage. To study the factors impacting multi-turn abilities, we create
single-turn versions of the 1170 multi-turn queries and compare performance.
Our evaluation of 11 well-known LLMs shows that while closed-source models
generally surpass open-source ones, certain open-source models exceed
GPT-3.5-Turbo in specific tasks. We observe significant performance degradation
in multi-turn settings compared to single-turn settings in most models, which
is not correlated with the models' fundamental capabilities. Moreover, we
identify the distance to relevant content and susceptibility to error
propagation as the key factors influencing multi-turn performance. MT-Eval is
released publicly to encourage future research towards more robust
conversational models.
Related papers
- P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning.
Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks.
We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks.
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z) - Evaluating Cost-Accuracy Trade-offs in Multimodal Search Relevance Judgements [1.6637373649145606]
Large Language Models (LLMs) have demonstrated potential as effective search relevance evaluators.
There is a lack of comprehensive guidance on which models consistently perform optimally across various contexts or within specific use cases.
Our analysis investigates the trade-offs between cost and accuracy, highlighting that model performance varies significantly depending on the context.
arXiv Detail & Related papers (2024-10-25T21:29:04Z) - What Matters for Model Merging at Scale? [94.26607564817786]
Model merging aims to combine multiple expert models into a more capable single model.
Previous studies have primarily focused on merging a few small models.
This study systematically evaluates the utility of model merging at scale.
arXiv Detail & Related papers (2024-10-04T17:17:19Z) - MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs [38.93090238335506]
Spurious bias, a tendency to use spurious correlations between non-essential input attributes and target variables for predictions, has revealed a severe pitfall in deep learning models trained on single modality data.
We introduce MM-SpuBench, a comprehensive visual question-answering (VQA) benchmark designed to evaluate MLLMs' reliance on nine distinct categories of spurious correlations.
Our findings illuminate the persistence of the reliance on spurious correlations from these models and underscore the urge for new methodologies to mitigate spurious biases.
arXiv Detail & Related papers (2024-06-24T20:29:16Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Corpus Considerations for Annotator Modeling and Scaling [9.263562546969695]
We show that the commonly used user token model consistently outperforms more complex models.
Our findings shed light on the relationship between corpus statistics and annotator modeling performance.
arXiv Detail & Related papers (2024-04-02T22:27:24Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - A Comparative Study of Transformer-Based Language Models on Extractive
Question Answering [0.5079811885340514]
We train various pre-trained language models and fine-tune them on multiple question answering datasets.
Using the F1-score as our metric, we find that the RoBERTa and BART pre-trained models perform the best across all datasets.
arXiv Detail & Related papers (2021-10-07T02:23:19Z) - Abstractive Sentence Summarization with Guidance of Selective Multimodal
Reference [3.505062507621494]
We propose a Multimodal Hierarchical Selective Transformer (mhsf) model that considers reciprocal relationships among modalities.
We evaluate the generalism of proposed mhsf model with the pre-trained+fine-tuning and fresh training strategies.
arXiv Detail & Related papers (2021-08-11T09:59:34Z) - Relating by Contrasting: A Data-efficient Framework for Multimodal
Generative Models [86.9292779620645]
We develop a contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data.
Under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.
arXiv Detail & Related papers (2020-07-02T15:08:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.