Related papers: TnT-LLM: Text Mining at Scale with Large Language Models

TnT-LLM: Text Mining at Scale with Large Language Models

URL: http://arxiv.org/abs/2403.12173v1
Date: Mon, 18 Mar 2024 18:45:28 GMT
Title: TnT-LLM: Text Mining at Scale with Large Language Models
Authors: Mengting Wan, Tara Safavi, Sujay Kumar Jauhar, Yujin Kim, Scott Counts, Jennifer Neville, Siddharth Suri, Chirag Shah, Ryen W White, Longqi Yang, Reid Andersen, Georg Buscher, Dhruv Joshi, Nagu Rangan,
Abstract summary: Large Language Models (LLMs) automate the process of end-to-end label generation and assignment with minimal human effort. We show that TnT-LLM generates more accurate and relevant label when compared against state-of-the-art baselines. We also share our practical experiences and insights on the challenges and opportunities of using LLMs for large-scale text mining in real-world applications.
Score: 24.731544646232962
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Transforming unstructured text into structured and meaningful forms, organized by useful category labels, is a fundamental step in text mining for downstream analysis and application. However, most existing methods for producing label taxonomies and building text-based label classifiers still rely heavily on domain expertise and manual curation, making the process expensive and time-consuming. This is particularly challenging when the label space is under-specified and large-scale data annotations are unavailable. In this paper, we address these challenges with Large Language Models (LLMs), whose prompt-based interface facilitates the induction and use of large-scale pseudo labels. We propose TnT-LLM, a two-phase framework that employs LLMs to automate the process of end-to-end label generation and assignment with minimal human effort for any given use-case. In the first phase, we introduce a zero-shot, multi-stage reasoning approach which enables LLMs to produce and refine a label taxonomy iteratively. In the second phase, LLMs are used as data labelers that yield training samples so that lightweight supervised classifiers can be reliably built, deployed, and served at scale. We apply TnT-LLM to the analysis of user intent and conversational domain for Bing Copilot (formerly Bing Chat), an open-domain chat-based search engine. Extensive experiments using both human and automatic evaluation metrics demonstrate that TnT-LLM generates more accurate and relevant label taxonomies when compared against state-of-the-art baselines, and achieves a favorable balance between accuracy and efficiency for classification at scale. We also share our practical experiences and insights on the challenges and opportunities of using LLMs for large-scale text mining in real-world applications.

Related papers

Automated Label Placement on Maps via Large Language Models [3.7553323195283697]
We introduce a new paradigm for automatic label placement (ALP) that formulates the task as a data editing problem.<n>To support this direction, we curate MAPLE, the first known benchmarking dataset for evaluating ALP on real-world maps.<n>We evaluate four open-source LLMs on MAPLE, analyzing both overall performance and generalization across different types of landmarks.
arXiv Detail & Related papers (2025-07-29T18:00:22Z)
An AI-Powered Research Assistant in the Lab: A Practical Guide for Text Analysis Through Iterative Collaboration with LLMs [0.7255608805275865]
Here we present a step-by-step tutorial to efficiently develop, test, and apply for analyzing unstructured data using LLMs.<n>We demonstrate how to write prompts to review datasets and generate a taxonomy of life domains, evaluate and refine the taxonomy through prompt and direct modifications, test the taxonomy and assess intercoder agreements, and apply the taxonomy to categorize an entire dataset with high intercoder reliability.
arXiv Detail & Related papers (2025-05-14T18:32:18Z)
Labels Generated by Large Language Model Helps Measuring People's Empathy in Vitro [9.536979155245026]
Large language models (LLMs) have revolutionised numerous fields. This paper explores its potential in in-vitro applications. We evaluate this approach in the emerging field of empathy computing.
arXiv Detail & Related papers (2025-01-01T01:06:58Z)
Can Large Language Models Serve as Effective Classifiers for Hierarchical Multi-Label Classification of Scientific Documents at Industrial Scale? [1.0562108865927007]
Large Language Models (LLMs) have demonstrated great potential in complex tasks such as multi-label classification. We present methods that combine the strengths of LLMs with dense retrieval techniques to overcome these challenges. We evaluate the effectiveness of our methods on SSRN, a large repository of preprints spanning multiple disciplines.
arXiv Detail & Related papers (2024-12-06T15:51:22Z)
On Unsupervised Prompt Learning for Classification with Black-box Language Models [71.60563181678323]
Large language models (LLMs) have achieved impressive success in text-formatted learning problems. LLMs can label datasets with even better quality than skilled human annotators. In this paper, we propose unsupervised prompt learning for classification with black-box LLMs.
arXiv Detail & Related papers (2024-10-04T03:39:28Z)
Zero-to-Strong Generalization: Eliciting Strong Capabilities of Large Language Models Iteratively without Gold Labels [75.77877889764073]
Large Language Models (LLMs) have demonstrated remarkable performance through supervised fine-tuning or in-context learning using gold labels. This study explores whether solely utilizing unlabeled data can elicit strong model capabilities. We propose a new paradigm termed zero-to-strong generalization.
arXiv Detail & Related papers (2024-09-19T02:59:44Z)
Scalable and Domain-General Abstractive Proposition Segmentation [20.532804009152255]
We focus on the task of abstractive proposition segmentation (APS): transforming text into simple, self-contained, well-formed sentences. We first introduce evaluation metrics for the task to measure several dimensions of quality. We then propose a scalable, yet accurate, proposition segmentation model.
arXiv Detail & Related papers (2024-06-28T10:24:31Z)
Knowledge Distillation in Automated Annotation: Supervised Text Classification with LLM-Generated Training Labels [0.0]
We assess the potential for researchers to augment or replace human-generated training data with surrogate training labels from large language models (LLMs) We employ a novel corpus of English-language text classification data sets from recent CSS articles in high-impact journals. For each task, we compare supervised classifiers fine-tuned using GPT-4 labels against classifiers fine-tuned with human annotations and against labels from GPT-4 and Mistral-7B with few-shot in-context learning. Our findings indicate that supervised classification models fine-tuned on LLM-generated labels perform comparably to models fine-tuned with labels from human
arXiv Detail & Related papers (2024-06-25T15:20:25Z)
Entity Alignment with Noisy Annotations from Large Language Models [15.189701951003611]
We propose a unified framework, LLM4EA, to effectively leverage Large Language Models for EA. Specifically, we design a novel active learning policy to significantly reduce the annotation space. We iteratively optimize the policy based on the feedback from a base EA model.
arXiv Detail & Related papers (2024-05-27T03:52:55Z)
LLMaAA: Making Large Language Models as Active Annotators [32.57011151031332]
We propose LLMaAA, which takes large language models as annotators and puts them into an active learning loop to determine what to annotate efficiently. We conduct experiments and analysis on two classic NLP tasks, named entity recognition and relation extraction. With LLMaAA, task-specific models trained from LLM-generated labels can outperform the teacher within only hundreds of annotated examples.
arXiv Detail & Related papers (2023-10-30T14:54:15Z)
GPT-NER: Named Entity Recognition via Large Language Models [58.609582116612934]
GPT-NER transforms the sequence labeling task to a generation task that can be easily adapted by Language Models. We find that GPT-NER exhibits a greater ability in the low-resource and few-shot setups, when the amount of training data is extremely scarce. This demonstrates the capabilities of GPT-NER in real-world NER applications where the number of labeled examples is limited.
arXiv Detail & Related papers (2023-04-20T16:17:26Z)
AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks. We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate. We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z)
Exploring Structured Semantic Prior for Multi Label Recognition with Incomplete Labels [60.675714333081466]
Multi-label recognition (MLR) with incomplete labels is very challenging. Recent works strive to explore the image-to-label correspondence in the vision-language model, ie, CLIP, to compensate for insufficient annotations. We advocate remedying the deficiency of label supervision for the MLR with incomplete labels by deriving a structured semantic prior.
arXiv Detail & Related papers (2023-03-23T12:39:20Z)
Ground Truth Inference for Weakly Supervised Entity Matching [76.6732856489872]
We propose a simple but powerful labeling model for weak supervision tasks. We then tailor the labeling model specifically to the task of entity matching. We show that our labeling model results in a 9% higher F1 score on average than the best existing method.
arXiv Detail & Related papers (2022-11-13T17:57:07Z)
An Empirical Study on Large-Scale Multi-Label Text Classification Including Few and Zero-Shot Labels [49.036212158261215]
Large-scale Multi-label Text Classification (LMTC) has a wide range of Natural Language Processing (NLP) applications. Current state-of-the-art LMTC models employ Label-Wise Attention Networks (LWANs) We show that hierarchical methods based on Probabilistic Label Trees (PLTs) outperform LWANs. We propose a new state-of-the-art method which combines BERT with LWANs.
arXiv Detail & Related papers (2020-10-04T18:55:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.