A Semi-Supervised Deep Clustering Pipeline for Mining Intentions From
Texts
- URL: http://arxiv.org/abs/2202.00802v1
- Date: Tue, 1 Feb 2022 23:01:05 GMT
- Title: A Semi-Supervised Deep Clustering Pipeline for Mining Intentions From
Texts
- Authors: Xinyu Chen and Ian Beaver
- Abstract summary: Verint Manager Intent (VIM) is an analysis platform that combines unsupervised and semi-supervised approaches to help analysts quickly surface and organize relevant user intentions from conversational texts.
For the initial exploration of data we make use of a novel unsupervised and semi-supervised pipeline that integrates the fine-tuning of high performing language models.
BERT produces better task-aware representations using a labeled subset as small as 0.5% of the task data.
- Score: 6.599344783327053
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mining the latent intentions from large volumes of natural language inputs is
a key step to help data analysts design and refine Intelligent Virtual
Assistants (IVAs) for customer service. To aid data analysts in this task we
present Verint Intent Manager (VIM), an analysis platform that combines
unsupervised and semi-supervised approaches to help analysts quickly surface
and organize relevant user intentions from conversational texts. For the
initial exploration of data we make use of a novel unsupervised and
semi-supervised pipeline that integrates the fine-tuning of high performing
language models, a distributed k-NN graph building method and community
detection techniques for mining the intentions and topics from texts. The
fine-tuning step is necessary because pre-trained language models cannot encode
texts to efficiently surface particular clustering structures when the target
texts are from an unseen domain or the clustering task is not topic detection.
For flexibility we deploy two clustering approaches: where the number of
clusters must be specified and where the number of clusters is detected
automatically with comparable clustering quality but at the expense of
additional computation time. We describe the application and deployment and
demonstrate its performance using BERT on three text mining tasks. Our
experiments show that BERT begins to produce better task-aware representations
using a labeled subset as small as 0.5% of the task data. The clustering
quality exceeds the state-of-the-art results when BERT is fine-tuned with
labeled subsets of only 2.5% of the task data. As deployed in the VIM
application, this flexible clustering pipeline produces high quality results,
improving the performance of data analysts and reducing the time it takes to
surface intentions from customer service data, thereby reducing the time it
takes to build and deploy IVAs in new domains.
Related papers
- Instance-Aware Graph Prompt Learning [71.26108600288308]
We introduce Instance-Aware Graph Prompt Learning (IA-GPL) in this paper.
The process involves generating intermediate prompts for each instance using a lightweight architecture.
Experiments conducted on multiple datasets and settings showcase the superior performance of IA-GPL compared to state-of-the-art baselines.
arXiv Detail & Related papers (2024-11-26T18:38:38Z) - Zero-shot prompt-based classification: topic labeling in times of foundation models in German Tweets [1.734165485480267]
We propose a new tool for automatically annotating text using written guidelines without providing training samples.
Our results show that the prompt-based approach is comparable with the fine-tuned BERT but without any annotated training data.
Our findings emphasize the ongoing paradigm shift in the NLP landscape, i.e., the unification of downstream tasks and elimination of the need for pre-labeled training data.
arXiv Detail & Related papers (2024-06-26T10:44:02Z) - STAR: Boosting Low-Resource Information Extraction by Structure-to-Text
Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances.
We design fine-grained step-by-step instructions to obtain the initial data instances.
Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z) - A Flexible Clustering Pipeline for Mining Text Intentions [6.599344783327053]
We create a flexible and scalable clustering pipeline within the Verint Intent Manager.
It integrates the fine-tuning of language models, a high performing k-NN library and community detection techniques.
As deployed in the VIM application, this clustering pipeline produces high quality results.
arXiv Detail & Related papers (2022-02-01T22:54:18Z) - SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation
on Natural Speech [44.68649535280397]
We propose a suite of benchmark tasks for Spoken Language Understanding Evaluation (SLUE)
SLUE consists of limited-size labeled training sets and corresponding evaluation sets.
We present the first phase of the SLUE benchmark suite, consisting of named entity recognition, sentiment analysis, and ASR on the corresponding datasets.
We provide new transcriptions and annotations on subsets of the VoxCeleb and VoxPopuli datasets, evaluation metrics and results for baseline models, and an open-source toolkit to reproduce the baselines and evaluate new models.
arXiv Detail & Related papers (2021-11-19T18:59:23Z) - Generative Conversational Networks [67.13144697969501]
We propose a framework called Generative Conversational Networks, in which conversational agents learn to generate their own labelled training data.
We show an average improvement of 35% in intent detection and 21% in slot tagging over a baseline model trained from the seed data.
arXiv Detail & Related papers (2021-06-15T23:19:37Z) - X2Parser: Cross-Lingual and Cross-Domain Framework for Task-Oriented
Compositional Semantic Parsing [51.81533991497547]
Task-oriented compositional semantic parsing (TCSP) handles complex nested user queries.
We present X2 compared a transferable Cross-lingual and Cross-domain for TCSP.
We propose to predict flattened intents and slots representations separately and cast both prediction tasks into sequence labeling problems.
arXiv Detail & Related papers (2021-06-07T16:40:05Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.