Related papers: Interpretable User Satisfaction Estimation for Conversational Systems with Large Language Models

Interpretable User Satisfaction Estimation for Conversational Systems with Large Language Models

URL: http://arxiv.org/abs/2403.12388v2
Date: Sun, 9 Jun 2024 00:58:25 GMT
Title: Interpretable User Satisfaction Estimation for Conversational Systems with Large Language Models
Authors: Ying-Chun Lin, Jennifer Neville, Jack W. Stokes, Longqi Yang, Tara Safavi, Mengting Wan, Scott Counts, Siddharth Suri, Reid Andersen, Xiaofeng Xu, Deepak Gupta, Sujay Kumar Jauhar, Xia Song, Georg Buscher, Saurabh Tiwary, Brent Hecht, Jaime Teevan,
Abstract summary: Existing approaches based on featurized ML models or text embeddings fall short in extracting generalizable patterns. We show that LLMs can extract interpretable signals of user satisfaction from their natural language utterances more effectively than embedding-based approaches.
Score: 35.95405294377247
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Accurate and interpretable user satisfaction estimation (USE) is critical for understanding, evaluating, and continuously improving conversational systems. Users express their satisfaction or dissatisfaction with diverse conversational patterns in both general-purpose (ChatGPT and Bing Copilot) and task-oriented (customer service chatbot) conversational systems. Existing approaches based on featurized ML models or text embeddings fall short in extracting generalizable patterns and are hard to interpret. In this work, we show that LLMs can extract interpretable signals of user satisfaction from their natural language utterances more effectively than embedding-based approaches. Moreover, an LLM can be tailored for USE via an iterative prompting framework using supervision from labeled examples. The resulting method, Supervised Prompting for User satisfaction Rubrics (SPUR), not only has higher accuracy but is more interpretable as it scores user satisfaction via learned rubrics with a detailed breakdown.

Related papers

Say It Another Way: Auditing LLMs with a User-Grounded Automated Paraphrasing Framework [9.162876771766513]
We introduce AUGMENT, a framework for generating controlled, realistic prompt paraphrases based on linguistic structure and user demographics.<n>AUGMENT ensures paraphrase quality through a combination of semantic, stylistic, and instruction-following criteria.<n>Our findings highlight the need for more representative and structured approaches to prompt variation in large language models.
arXiv Detail & Related papers (2025-05-06T14:17:30Z)
CLEAR-KGQA: Clarification-Enhanced Ambiguity Resolution for Knowledge Graph Question Answering [13.624962763072899]
KGQA systems typically assume user queries are unambiguous, which is an assumption that rarely holds in real-world applications. We propose a novel framework that dynamically handles both entity ambiguity (e.g., distinguishing between entities with similar names) and intent ambiguity (e.g., clarifying different interpretations of user queries) through interactive clarification.
arXiv Detail & Related papers (2025-04-13T17:34:35Z)
LLM-Augmented Graph Neural Recommenders: Integrating User Reviews [2.087411180679868]
We propose a framework that employs a Graph Neural Network (GNN)-based model and an large language model (LLMs) to produce review-aware representations. Our approach balances user-item interactions against text-derived features, ensuring that user's both behavioral and linguistic signals are effectively captured.
arXiv Detail & Related papers (2025-04-03T00:40:09Z)
LLM-guided Plan and Retrieval: A Strategic Alignment for Interpretable User Satisfaction Estimation in Dialogue [5.070104802923903]
PRAISE is an interpretable framework for effective user satisfaction prediction. It operates through three key modules. It achieves state-of-the-art performance on three benchmarks for the User Satisfaction Estimation task.
arXiv Detail & Related papers (2025-03-06T18:12:33Z)
UQABench: Evaluating User Embedding for Prompting LLMs in Personalized Question Answering [39.79275025010785]
name is a benchmark designed to evaluate the effectiveness of user embeddings in prompting large language models for personalization. We conduct extensive experiments on various state-of-the-art methods for modeling user embeddings.
arXiv Detail & Related papers (2025-02-26T14:34:00Z)
Unveiling User Preferences: A Knowledge Graph and LLM-Driven Approach for Conversational Recommendation [55.5687800992432]
We propose a plug-and-play framework that synergizes Large Language Models (LLMs) and Knowledge Graphs (KGs) to unveil user preferences. This enables the LLM to transform KG entities into concise natural language descriptions, allowing them to comprehend domain-specific knowledge.
arXiv Detail & Related papers (2024-11-16T11:47:21Z)
Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text [12.879551933541345]
Large Language Models (LLMs) are capable of generating human-like conversations. Conventional metrics like BLEU and ROUGE are inadequate for capturing the subtle semantics and contextual richness of such generative outputs. We propose a reference-guided verdict method that automates the evaluation process by leveraging multiple LLMs-as-judges.
arXiv Detail & Related papers (2024-08-17T16:01:45Z)
Large Language Models are Interpretable Learners [53.56735770834617]
In this paper, we show a combination of Large Language Models (LLMs) and symbolic programs can bridge the gap between expressiveness and interpretability. The pretrained LLM with natural language prompts provides a massive set of interpretable modules that can transform raw input into natural language concepts. As the knowledge learned by LSP is a combination of natural language descriptions and symbolic rules, it is easily transferable to humans (interpretable) and other LLMs.
arXiv Detail & Related papers (2024-06-25T02:18:15Z)
Interactive Analysis of LLMs using Meaningful Counterfactuals [22.755345889167934]
Counterfactual examples are useful for exploring the decision boundaries of machine learning models. How can we apply counterfactual-based methods to analyze and explain LLMs? We propose a novel algorithm for generating batches of complete and meaningful textual counterfactuals. In our experiments, 97.2% of the counterfactuals are grammatically correct.
arXiv Detail & Related papers (2024-04-23T19:57:03Z)
CAUSE: Counterfactual Assessment of User Satisfaction Estimation in Task-Oriented Dialogue Systems [60.27663010453209]
We leverage large language models (LLMs) to generate satisfaction-aware counterfactual dialogues. We gather human annotations to ensure the reliability of the generated samples. Our results shed light on the need for data augmentation approaches for user satisfaction estimation in TOD systems.
arXiv Detail & Related papers (2024-03-27T23:45:31Z)
RecExplainer: Aligning Large Language Models for Explaining Recommendation Models [50.74181089742969]
Large language models (LLMs) have demonstrated remarkable intelligence in understanding, reasoning, and instruction following. This paper presents the initial exploration of using LLMs as surrogate models to explain black-box recommender models. To facilitate an effective alignment, we introduce three methods: behavior alignment, intention alignment, and hybrid alignment.
arXiv Detail & Related papers (2023-11-18T03:05:43Z)
Unlocking the Potential of User Feedback: Leveraging Large Language Model as User Simulator to Enhance Dialogue System [65.93577256431125]
We propose an alternative approach called User-Guided Response Optimization (UGRO) to combine it with a smaller task-oriented dialogue model. This approach uses LLM as annotation-free user simulator to assess dialogue responses, combining them with smaller fine-tuned end-to-end TOD models. Our approach outperforms previous state-of-the-art (SOTA) results.
arXiv Detail & Related papers (2023-06-16T13:04:56Z)
Rethinking the Evaluation for Conversational Recommendation in the Era of Large Language Models [115.7508325840751]
The recent success of large language models (LLMs) has shown great potential to develop more powerful conversational recommender systems (CRSs) In this paper, we embark on an investigation into the utilization of ChatGPT for conversational recommendation, revealing the inadequacy of the existing evaluation protocol. We propose an interactive Evaluation approach based on LLMs named iEvaLM that harnesses LLM-based user simulators.
arXiv Detail & Related papers (2023-05-22T15:12:43Z)
Approximating Online Human Evaluation of Social Chatbots with Prompting [11.657633779338724]
Existing evaluation metrics aim to automate offline user evaluation and approximate human judgment of pre-curated dialogs. We propose an approach to approximate online human evaluation leveraging large language models (LLMs) from the GPT family. We introduce a new Dialog system Evaluation framework based on Prompting (DEP), which enables a fully automatic evaluation pipeline.
arXiv Detail & Related papers (2023-04-11T14:45:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.