Related papers: Amulet: Putting Complex Multi-Turn Conversations on the Stand with LLM Juries

Amulet: Putting Complex Multi-Turn Conversations on the Stand with LLM Juries

URL: http://arxiv.org/abs/2505.20451v1
Date: Mon, 26 May 2025 18:46:38 GMT
Title: Amulet: Putting Complex Multi-Turn Conversations on the Stand with LLM Juries
Authors: Sahana Ramnath, Anurag Mudgil, Brihi Joshi, Skyler Hallinan, Xiang Ren,
Abstract summary: Amulet is a framework that leverages pertinent linguistic concepts of dialog-acts and maxims to improve the accuracy of LLM-judges.<n>Amulet can be used either as a judge by applying the framework to a single LLM, or integrated into a jury with different LLM judges.
Score: 30.095571420819912
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Today, large language models are widely used as judges to evaluate responses from other language models. Hence, it is imperative to benchmark and improve these LLM-judges on real-world language model usage: a typical human-assistant conversation is lengthy, and shows significant diversity in topics, intents, and requirements across turns, e.g. social interactions, task requests, feedback. We present Amulet, a framework that leverages pertinent linguistic concepts of dialog-acts and maxims to improve the accuracy of LLM-judges on preference data with complex, multi-turn conversational context. Amulet presents valuable insights about (a) the communicative structures and intents present in the conversation (dialog acts), and (b) the satisfaction of conversational principles (maxims) by the preference responses, and uses them to make judgments. On four challenging datasets, Amulet shows that (a) humans frequently (60 to 70 percent of the time) change their intents from one turn of the conversation to the next, and (b) in 75 percent of instances, the preference responses can be differentiated via dialog acts and/or maxims, reiterating the latter's significance in judging such data. Amulet can be used either as a judge by applying the framework to a single LLM, or integrated into a jury with different LLM judges; our judges and juries show strong improvements on relevant baselines for all four datasets.

Related papers

When Large Language Models are Reliable for Judging Empathic Communication [41.01696584595341]
Large language models (LLMs) excel at generating empathic responses in text-based conversations.<n>How reliably do they judge the nuances of empathic communication?<n>We compare how experts, crowdworkers, and LLMs annotate empathic communication across four evaluative frameworks.
arXiv Detail & Related papers (2025-06-11T20:10:23Z)
Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation [17.330188045948663]
We introduce Debate Speech Evaluation as a novel and challenging benchmark for assessing LLM judges.<n>This task involves a unique set of cognitive abilities that have previously received limited attention in systematic benchmarking.<n>We leverage a dataset of over 600 meticulously annotated debate speeches and present the first in-depth analysis of how state-of-the-art LLMs compare to human judges on this task.
arXiv Detail & Related papers (2025-06-05T14:06:51Z)
MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue Evaluators [8.672875654352689]
This paper introduces MEDAL, an automated multi-agent framework for generating, evaluating, and curating dialogue evaluation benchmarks.<n>We generate user-chatbot multilingual dialogues conditioned on varied seed contexts.<n>A strong LLM is then used for a multidimensional analysis of the performance of the chatbots, uncovering noticeable cross-lingual performance differences.<n>A benchmark is then used to assess the ability of several reasoning and non-reasoning LLMs to act as evaluators of open-domain dialogues.
arXiv Detail & Related papers (2025-05-28T18:45:42Z)
Arbiters of Ambivalence: Challenges of Using LLMs in No-Consensus Tasks [52.098988739649705]
This study examines the biases and limitations of LLMs in three roles: answer generator, judge, and debater.<n>We develop a no-consensus'' benchmark by curating examples that encompass a variety of a priori ambivalent scenarios.<n>Our results show that while LLMs can provide nuanced assessments when generating open-ended answers, they tend to take a stance on no-consensus topics when employed as judges or debaters.
arXiv Detail & Related papers (2025-05-28T01:31:54Z)
Multimodal Conversation Structure Understanding [12.29827265137757]
Large language models' ability to understand fine-grained conversational structure remains underexplored.<n>We present a human annotated dataset of 4,398 annotations for speakers and reply-to relationship, 5,755 addressees, and 3,142 side-participants.<n>We evaluate popular audio-visual LLMs and vision-language models on our dataset, and our experimental results suggest that multimodal conversational structure understanding remains challenging.
arXiv Detail & Related papers (2025-05-23T06:41:54Z)
Human Preferences for Constructive Interactions in Language Model Alignment [0.0]
We examined how linguistic attributes linked to constructive interactions are reflected in human preference data used for training AI.<n>We found that users consistently preferred well-reasoned and nuanced responses while rejecting those high in personal storytelling.
arXiv Detail & Related papers (2025-03-05T15:08:41Z)
REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation [51.97224538045096]
We introduce REALTALK, a 21-day corpus of authentic messaging app dialogues.<n>We compare EI attributes and persona consistency to understand the challenges posed by real-world dialogues.<n>Our findings reveal that models struggle to simulate a user solely from dialogue history, while fine-tuning on specific user chats improves persona emulation.
arXiv Detail & Related papers (2025-02-18T20:29:01Z)
MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation [52.35744453954844]
This paper introduces MMRC, a benchmark for evaluating six core open-ended abilities of MLLMs.<n> Evaluations on 20 MLLMs in MMRC indicate an accuracy drop during open-ended interactions.<n>We propose a simple yet effective NOTE-TAKING strategy, which can record key information from the conversation and remind the model during its responses.
arXiv Detail & Related papers (2025-02-17T15:24:49Z)
RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues [8.036117602566074]
external retrieval mechanisms are often employed to enhance the quality of augmented generations in dialogues.<n>Existing benchmarks either assess LLMs' chat abilities in multi-turn dialogues or their use of retrieval for augmented responses in single-turn settings.<n>We introduce RAD-Bench, a benchmark designed to evaluate LLMs' capabilities in multi-turn dialogues following retrievals.
arXiv Detail & Related papers (2024-09-19T08:26:45Z)
Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training [33.57497419019826]
Action-Based Contrastive Self-Training allows for sample-efficient dialogue policy learning in multi-turn conversation. ACT demonstrates substantial conversation modeling improvements over standard approaches to supervised fine-tuning and DPO.
arXiv Detail & Related papers (2024-05-31T22:44:48Z)
Exploring the Factual Consistency in Dialogue Comprehension of Large Language Models [51.75805497456226]
This work focuses on the factual consistency issue with the help of the dialogue summarization task. Our evaluation shows that, on average, 26.8% of the summaries generated by LLMs contain factual inconsistency. To stimulate and enhance the dialogue comprehension ability of LLMs, we propose a fine-tuning paradigm with auto-constructed multi-task data.
arXiv Detail & Related papers (2023-11-13T09:32:12Z)
BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues [72.65163468440434]
This report provides a preliminary evaluation of existing large language models for human-style multi-turn chatting. We prompt large language models (LLMs) to generate a full multi-turn dialogue based on the ChatSEED, utterance by utterance. We find GPT-4 can generate human-style multi-turn dialogues with impressive quality, significantly outperforms its counterparts.
arXiv Detail & Related papers (2023-10-20T16:53:51Z)
Cue-CoT: Chain-of-thought Prompting for Responding to In-depth Dialogue Questions with LLMs [59.74002011562726]
We propose a novel linguistic cue-based chain-of-thoughts (textitCue-CoT) to provide a more personalized and engaging response. We build a benchmark with in-depth dialogue questions, consisting of 6 datasets in both Chinese and English. Empirical results demonstrate our proposed textitCue-CoT method outperforms standard prompting methods in terms of both textithelpfulness and textitacceptability on all datasets.
arXiv Detail & Related papers (2023-05-19T16:27:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.