ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering
- URL: http://arxiv.org/abs/2507.00828v1
- Date: Tue, 01 Jul 2025 15:00:55 GMT
- Title: ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering
- Authors: Alexander Hoyle, Lorena Calvo-Bartolomé, Jordan Boyd-Graber, Philip Resnik,
- Abstract summary: We design a scalable human evaluation protocol that reflects practitioners' real-world usage of models.<n>We use this protocol to collect extensive crowdworker annotations of outputs from a diverse set of topic models.<n>We then use these annotations to validate automated proxies, finding that the best LLM proxies are statistically indistinguishable from a human annotator.
- Score: 52.19512723549318
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Topic model and document-clustering evaluations either use automated metrics that align poorly with human preferences or require expert labels that are intractable to scale. We design a scalable human evaluation protocol and a corresponding automated approximation that reflect practitioners' real-world usage of models. Annotators -- or an LLM-based proxy -- review text items assigned to a topic or cluster, infer a category for the group, then apply that category to other documents. Using this protocol, we collect extensive crowdworker annotations of outputs from a diverse set of topic models on two datasets. We then use these annotations to validate automated proxies, finding that the best LLM proxies are statistically indistinguishable from a human annotator and can therefore serve as a reasonable substitute in automated evaluations. Package, web interface, and data are at https://github.com/ahoho/proxann
Related papers
- Grounding Synthetic Data Evaluations of Language Models in Unsupervised Document Corpora [9.871701356351542]
Language Models (LMs) continue to advance, improving response quality and coherence.<n>A plethora of evaluation benchmarks have been constructed to assess model quality, response appropriateness, and reasoning capabilities.<n>We propose a methodology for automating the construction of fact-based synthetic data model evaluations grounded in document populations.
arXiv Detail & Related papers (2025-05-13T18:50:03Z) - Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z) - How to Evaluate Entity Resolution Systems: An Entity-Centric Framework with Application to Inventor Name Disambiguation [1.7812428873698403]
We propose an entity-centric data labeling methodology that integrates with a unified framework for monitoring summary statistics.
These benchmark data sets can then be used for model training and a variety of evaluation tasks.
arXiv Detail & Related papers (2024-04-08T15:53:29Z) - BYOC: Personalized Few-Shot Classification with Co-Authored Class
Descriptions [2.076173115539025]
We propose a novel approach to few-shot text classification using an LLM.
Rather than few-shot examples, the LLM is prompted with descriptions of the salient features of each class.
Examples, questions, and answers are summarized to form the classification prompt.
arXiv Detail & Related papers (2023-10-09T19:37:38Z) - Towards Open-Domain Topic Classification [69.21234350688098]
We introduce an open-domain topic classification system that accepts user-defined taxonomy in real time.
Users will be able to classify a text snippet with respect to any candidate labels they want, and get instant response from our web interface.
arXiv Detail & Related papers (2023-06-29T20:25:28Z) - Natural Language-Based Synthetic Data Generation for Cluster Analysis [4.13592995550836]
Cluster analysis relies on effective benchmarks for evaluating and comparing different algorithms.<n>We propose synthetic data generation based on direct specification of high-level scenarios.<n>Our open-source Python package repliclust implements this workflow.
arXiv Detail & Related papers (2023-03-24T23:45:27Z) - What is wrong with you?: Leveraging User Sentiment for Automatic Dialog
Evaluation [73.03318027164605]
We propose to use information that can be automatically extracted from the next user utterance as a proxy to measure the quality of the previous system response.
Our model generalizes across both spoken and written open-domain dialog corpora collected from real and paid users.
arXiv Detail & Related papers (2022-03-25T22:09:52Z) - Ranking Models in Unlabeled New Environments [74.33770013525647]
We introduce the problem of ranking models in unlabeled new environments.
We use a proxy dataset that 1) is fully labeled and 2) well reflects the true model rankings in a given target environment.
Specifically, datasets that are more similar to the unlabeled target domain are found to better preserve the relative performance rankings.
arXiv Detail & Related papers (2021-08-23T17:57:15Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.