Contextualized Topic Coherence Metrics
        - URL: http://arxiv.org/abs/2305.14587v1
- Date: Tue, 23 May 2023 23:53:29 GMT
- Title: Contextualized Topic Coherence Metrics
- Authors: Hamed Rahimi, Jacob Louis Hoover, David Mimno, Hubert Naacke, Camelia
  Constantin, Bernd Amann
- Abstract summary: We propose methods inspired by standard human topic evaluations, in a family of metrics called Contextualized Topic Coherence ( CTC)
We evaluate CTC relative to five other metrics on six topic models and find that it outperforms automated topic coherence methods.
- Score: 6.630482733703617
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   The recent explosion in work on neural topic modeling has been criticized for
optimizing automated topic evaluation metrics at the expense of actual
meaningful topic identification. But human annotation remains expensive and
time-consuming. We propose LLM-based methods inspired by standard human topic
evaluations, in a family of metrics called Contextualized Topic Coherence
(CTC). We evaluate both a fully automated version as well as a semi-automated
CTC that allows human-centered evaluation of coherence while maintaining the
efficiency of automated methods. We evaluate CTC relative to five other metrics
on six topic models and find that it outperforms automated topic coherence
methods, works well on short documents, and is not susceptible to meaningless
but high-scoring topics.
 
      
        Related papers
        - Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey [64.08485471150486]
 This survey examines evaluation methods for large language model (LLM)-based agents in multi-turn conversational settings.
We systematically reviewed nearly 250 scholarly sources, capturing the state of the art from various venues of publication.
 arXiv  Detail & Related papers  (2025-03-28T14:08:40Z)
- CSEval: Towards Automated, Multi-Dimensional, and Reference-Free   Counterspeech Evaluation using Auto-Calibrated LLMs [18.827745815939213]
 We introduce CSEval, a novel dataset and framework for evaluating counterspeech quality across four dimensions.
We propose Auto-Calibrated COT for Counterspeech Evaluation (Auto-CSEval), a prompt-based method with auto-calibrated chain-of-thoughts.
Our experiments show that Auto-CSEval outperforms traditional metrics like ROUGE, METEOR, and BertScore in correlating with human judgement.
 arXiv  Detail & Related papers  (2025-01-29T11:38:29Z)
- Towards Automatic Evaluation for Image Transcreation [52.71090829502756]
 We propose a suite of automatic evaluation metrics inspired by machine translation (MT) metrics.
We identify three critical dimensions of image transcreation: cultural relevance, semantic equivalence and visual similarity.
Our results show that proprietary VLMs best identify cultural relevance and semantic equivalence, while vision-encoder representations are adept at measuring visual similarity.
 arXiv  Detail & Related papers  (2024-12-18T10:55:58Z)
- Automatic Die Studies for Ancient Numismatics [3.384989790372139]
 Die studies are fundamental to quantifying ancient monetary production.
Few works have attempted to automate this task, and none have been properly released and evaluated from a computer vision perspective.
We propose a fully automatic approach that introduces several innovations compared to previous methods.
 arXiv  Detail & Related papers  (2024-07-30T14:54:54Z)
- Evaluating the IWSLT2023 Speech Translation Tasks: Human Annotations,   Automatic Metrics, and Segmentation [50.60733773088296]
 We conduct a comprehensive human evaluation of the results of several shared tasks from the last International Workshop on Spoken Language Translation (IWSLT 2023)
We propose an effective evaluation strategy based on automatic resegmentation and direct assessment with segment context.
Our analysis revealed that: 1) the proposed evaluation strategy is robust and scores well-correlated with other types of human judgements; 2) automatic metrics are usually, but not always, well-correlated with direct assessment scores; and 3) COMET as a slightly stronger automatic metric than chrF.
 arXiv  Detail & Related papers  (2024-06-06T09:18:42Z)
- Beyond Static Evaluation: A Dynamic Approach to Assessing AI Assistants'   API Invocation Capabilities [48.922660354417204]
 We propose Automated Dynamic Evaluation (AutoDE) to assess an assistant's API call capability without human involvement.
In our framework, we endeavor to closely mirror genuine human conversation patterns in human-machine interactions.
 arXiv  Detail & Related papers  (2024-03-17T07:34:12Z)
- Improving the TENOR of Labeling: Re-evaluating Topic Models for Content
  Analysis [5.757610495733924]
 We conduct the first evaluation of neural, supervised and classical topic models in an interactive task based setting.
We show that current automated metrics do not provide a complete picture of topic modeling capabilities.
 arXiv  Detail & Related papers  (2024-01-29T17:54:04Z)
- Correction of Errors in Preference Ratings from Automated Metrics for
  Text Generation [4.661309379738428]
 We propose a statistical model of Text Generation evaluation that accounts for the error-proneness of automated metrics.
We show that our model enables an efficient combination of human and automated ratings to remedy the error-proneness of the automated metrics.
 arXiv  Detail & Related papers  (2023-06-06T17:09:29Z)
- Automated Metrics for Medical Multi-Document Summarization Disagree with
  Human Evaluations [22.563596069176047]
 We analyze how automated summarization evaluation metrics correlate with lexical features of generated summaries.
We find that not only do automated metrics fail to capture aspects of quality as assessed by humans, in many cases the system rankings produced by these metrics are anti-correlated with rankings according to human annotators.
 arXiv  Detail & Related papers  (2023-05-23T05:00:59Z)
- The Glass Ceiling of Automatic Evaluation in Natural Language Generation [60.59732704936083]
 We take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics.
Our extensive statistical analysis reveals surprising findings: automatic metrics -- old and new -- are much more similar to each other than to humans.
 arXiv  Detail & Related papers  (2022-08-31T01:13:46Z)
- Is Automated Topic Model Evaluation Broken?: The Incoherence of
  Coherence [62.826466543958624]
 We look at the standardization gap and the validation gap in topic model evaluation.
Recent models relying on neural components surpass classical topic models according to these metrics.
We use automatic coherence along with the two most widely accepted human judgment tasks, namely, topic rating and word intrusion.
 arXiv  Detail & Related papers  (2021-07-05T17:58:52Z)
- Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
  Evaluation Approach [84.02388020258141]
 We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
 ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
 arXiv  Detail & Related papers  (2021-02-20T03:29:20Z)
- Unsupervised Summarization for Chat Logs with Topic-Oriented Ranking and
  Context-Aware Auto-Encoders [59.038157066874255]
 We propose a novel framework called RankAE to perform chat summarization without employing manually labeled data.
RankAE consists of a topic-oriented ranking strategy that selects topic utterances according to centrality and diversity simultaneously.
A denoising auto-encoder is designed to generate succinct but context-informative summaries based on the selected utterances.
 arXiv  Detail & Related papers  (2020-12-14T07:31:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.