MPCEval: A Benchmark for Multi-Party Conversation Generation
- URL: http://arxiv.org/abs/2603.04969v1
- Date: Thu, 05 Mar 2026 09:07:00 GMT
- Title: MPCEval: A Benchmark for Multi-Party Conversation Generation
- Authors: Minxing Zhang, Yi Yang, Zhuofan Jia, Xuan Yang, Jian Pei, Yuchen Zang, Xingwang Deng, Xianglong Chen,
- Abstract summary: We introduce MPCEval, a task-aware evaluation and benchmarking suite for multi-party conversation generation.<n>MPCEval decomposes generation quality into speaker modeling, content quality, and speaker--content consistency.<n>We apply MPCEval to diverse public and real-world datasets and evaluate modern generation methods alongside human-authored conversations.
- Score: 23.227067535888768
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-party conversation generation, such as smart reply and collaborative assistants, is an increasingly important capability of generative AI, yet its evaluation remains a critical bottleneck. Compared to two-party dialogue, multi-party settings introduce distinct challenges, including complex turn-taking, role-dependent speaker behavior, long-range conversational structure, and multiple equally valid continuations. Accordingly, we introduce MPCEval, a task-aware evaluation and benchmarking suite for multi-party conversation generation. MPCEval decomposes generation quality into speaker modeling, content quality, and speaker--content consistency, and explicitly distinguishes local next-turn prediction from global full-conversation generation. It provides novel, quantitative, reference-free, and reproducible metrics that scale across datasets and models. We apply MPCEval to diverse public and real-world datasets and evaluate modern generation methods alongside human-authored conversations. The results reveal systematic, dimension-specific model characteristics in participation balance, content progression and novelty, and speaker--content consistency, demonstrating that evaluation objectives critically shape model assessment and that single-score evaluation obscures fundamental differences in multi-party conversational behavior. The implementation of MPCEval and the associated evaluation code are publicly available at https://github.com/Owen-Yang-18/MPCEval.
Related papers
- On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation [88.77441715819366]
Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content.<n>We propose a variety of likelihood- and generative-based evaluation methods that serve in place of naive global token perplexity.
arXiv Detail & Related papers (2026-01-09T22:01:56Z) - EvalTalker: Learning to Evaluate Real-Portrait-Driven Multi-Subject Talking Humans [86.21111833841684]
We present THQA-MT, the first large-scale Multi-Talker-generated Talking Human Quality Assessment dataset.<n>We analyze perceptual discrepancies among different Multi-Talkers and identify 12 common types of distortion.<n>We introduce EvalTalker, a novel TH quality assessment framework.
arXiv Detail & Related papers (2025-12-01T06:56:40Z) - GrandJury: A Collaborative Machine Learning Model Evaluation Protocol for Dynamic Quality Rubrics [0.0]
Generative Machine Learning models have become central to modern systems, powering applications in creative writing, summarization, multi-hop reasoning, and context-aware dialogue.<n>Standard evaluation regimes still rely on static, benchmark-style tests, incentivizing optimization toward leaderboard scores rather than alignment with dynamic user needs or evolving realities.<n>GrandJury introduces a formal evaluation protocol combining time-decayed aggregation, complete traceability, with the support of dynamic, transparent task attribution, and human judgment.
arXiv Detail & Related papers (2025-08-04T22:00:44Z) - A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations [112.81207927088117]
PersonaConvBench is a benchmark for evaluating personalized reasoning and generation in multi-turn conversations with large language models (LLMs)<n>We benchmark several commercial and open-source LLMs under a unified prompting setup and observe that incorporating personalized history yields substantial performance improvements.
arXiv Detail & Related papers (2025-05-20T09:13:22Z) - UniMS-RAG: A Unified Multi-source Retrieval-Augmented Generation for Personalized Dialogue Systems [43.266153244137215]
Large Language Models (LLMs) has shown exceptional capabilities in many natual language understanding and generation tasks.
We decompose the use of multiple sources in generating personalized response into three sub-tasks: Knowledge Source Selection, Knowledge Retrieval, and Response Generation.
We propose a novel Unified Multi-Source Retrieval-Augmented Generation system (UniMS-RAG)
arXiv Detail & Related papers (2024-01-24T06:50:20Z) - Harmonizing Code-mixed Conversations: Personality-assisted Code-mixed
Response Generation in Dialogues [28.49660948650183]
We introduce a novel approach centered on harnessing the Big Five personality traits acquired in an unsupervised manner from the conversations to bolster the performance of response generation.
This is evident in the increase observed in ROUGE and BLUE scores for the response generation task when the identified personality is seamlessly integrated into the dialogue context.
arXiv Detail & Related papers (2024-01-18T15:21:16Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - HeterMPC: A Heterogeneous Graph Neural Network for Response Generation
in Multi-Party Conversations [76.64792382097724]
We present HeterMPC, a graph-based neural network for response generation in multi-party conversations (MPCs)
HeterMPC models the semantics of utterances and interlocutors simultaneously with two types of nodes in a graph.
Through multi-hop updating, HeterMPC can adequately utilize the structural knowledge of conversations for response generation.
arXiv Detail & Related papers (2022-03-16T09:50:32Z) - Multi-View Sequence-to-Sequence Models with Conversational Structure for
Abstractive Dialogue Summarization [72.54873655114844]
Text summarization is one of the most challenging and interesting problems in NLP.
This work proposes a multi-view sequence-to-sequence model by first extracting conversational structures of unstructured daily chats from different views to represent conversations.
Experiments on a large-scale dialogue summarization corpus demonstrated that our methods significantly outperformed previous state-of-the-art models via both automatic evaluations and human judgment.
arXiv Detail & Related papers (2020-10-04T20:12:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.