Related papers: Detecting Response Generation Not Requiring Factual Judgment

Detecting Response Generation Not Requiring Factual Judgment

URL: http://arxiv.org/abs/2406.09702v1
Date: Fri, 14 Jun 2024 04:03:24 GMT
Title: Detecting Response Generation Not Requiring Factual Judgment
Authors: Ryohei Kamei, Daiki Shiono, Reina Akama, Jun Suzuki,
Abstract summary: This study aimed to achieve both attractiveness and factuality in a dialogue response for which a task was set to predict sentences that do not require factual correctness judgment. We created a dataset, dialogue dataset annotated with fact-check-needed label (DDFC), for this task via crowdsourcing, and classification tasks were performed on several models using this dataset. The model with the highest classification accuracy could yield about 88% accurate classification results.
Score: 14.921007421043198
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the remarkable development of large language models (LLMs), ensuring the factuality of output has become a challenge. However, having all the contents of the response with given knowledge or facts is not necessarily a good thing in dialogues. This study aimed to achieve both attractiveness and factuality in a dialogue response for which a task was set to predict sentences that do not require factual correctness judgment such as agreeing, or personal opinions/feelings. We created a dataset, dialogue dataset annotated with fact-check-needed label (DDFC), for this task via crowdsourcing, and classification tasks were performed on several models using this dataset. The model with the highest classification accuracy could yield about 88% accurate classification results.

Related papers

A Knowledge Graph and a Tripartite Evaluation Framework Make Retrieval-Augmented Generation Scalable and Transparent [0.0]
This study presents a Retrieval Augmented Generation (RAG) that harnesses a knowledge graph and vector search retrieval to deliver context-rich responses.<n>A central innovation of this work is the introduction of RAG Evaluation (RAG-Eval), a novel chain-of-thought tripartite evaluation framework.<n>RAG-Eval reliably detects factual gaps and query mismatches, thereby fostering trust in high demand, data centric environments.
arXiv Detail & Related papers (2025-09-23T16:29:22Z)
FineDialFact: A benchmark for Fine-grained Dialogue Fact Verification [45.2458418225596]
Large Language Models (LLMs) are known to produce hallucinations - factually incorrect or fabricated information.<n>Current approaches to hallucination detection in dialogue systems primarily focus on verifying the factual consistency of generated responses.<n>We introduce a benchmark, FineDialFact, for fine-grained dialogue fact verification.
arXiv Detail & Related papers (2025-08-07T18:51:03Z)
Pointwise Mutual Information as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that the pointwise mutual information between a context and a question is an effective gauge for language model performance. We propose two methods that use the pointwise mutual information between a document and a question as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
Exploring the Factual Consistency in Dialogue Comprehension of Large Language Models [51.75805497456226]
This work focuses on the factual consistency issue with the help of the dialogue summarization task. Our evaluation shows that, on average, 26.8% of the summaries generated by LLMs contain factual inconsistency. To stimulate and enhance the dialogue comprehension ability of LLMs, we propose a fine-tuning paradigm with auto-constructed multi-task data.
arXiv Detail & Related papers (2023-11-13T09:32:12Z)
Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world. We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique. By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z)
Revisiting text decomposition methods for NLI-based factuality scoring of summaries [9.044665059626958]
We show that fine-grained decomposition is not always a winning strategy for factuality scoring. We also show that small changes to previously proposed entailment-based scoring methods can result in better performance.
arXiv Detail & Related papers (2022-11-30T09:54:37Z)
Questioning the Validity of Summarization Datasets and Improving Their Factual Consistency [14.974996886744083]
We release SummFC, a filtered summarization dataset with improved factual consistency. We argue that our dataset should become a valid benchmark for developing and evaluating summarization systems.
arXiv Detail & Related papers (2022-10-31T15:04:20Z)
Weakly Supervised Data Augmentation Through Prompting for Dialogue Understanding [103.94325597273316]
We present a novel approach that iterates on augmentation quality by applying weakly-supervised filters. We evaluate our methods on the emotion and act classification tasks in DailyDialog and the intent classification task in Facebook Multilingual Task-Oriented Dialogue. For DailyDialog specifically, using 10% of the ground truth data we outperform the current state-of-the-art model which uses 100% of the data.
arXiv Detail & Related papers (2022-10-25T17:01:30Z)
Robustness of end-to-end Automatic Speech Recognition Models -- A Case Study using Mozilla DeepSpeech [2.715884199292287]
We argue that many performance numbers reported probably underestimate the expected error rate. We conduct experiments controlling for selection bias, gender as well as overlap (between training and test data) in content, voices, and recording conditions.
arXiv Detail & Related papers (2021-05-08T16:46:44Z)
Generating Fact Checking Explanations [52.879658637466605]
A crucial piece of the puzzle that is still missing is to understand how to automate the most elaborate part of the process. This paper provides the first study of how these explanations can be generated automatically based on available claim context. Our results indicate that optimising both objectives at the same time, rather than training them separately, improves the performance of a fact checking system.
arXiv Detail & Related papers (2020-04-13T05:23:25Z)
Improving Multi-Turn Response Selection Models with Complementary Last-Utterance Selection by Instance Weighting [84.9716460244444]
We consider utilizing the underlying correlation in the data resource itself to derive different kinds of supervision signals. We conduct extensive experiments in two public datasets and obtain significant improvement in both datasets.
arXiv Detail & Related papers (2020-02-18T06:29:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.