A Flexible Clustering Pipeline for Mining Text Intentions
- URL: http://arxiv.org/abs/2202.01211v1
- Date: Tue, 1 Feb 2022 22:54:18 GMT
- Title: A Flexible Clustering Pipeline for Mining Text Intentions
- Authors: Xinyu Chen and Ian Beaver
- Abstract summary: We create a flexible and scalable clustering pipeline within the Verint Intent Manager.
It integrates the fine-tuning of language models, a high performing k-NN library and community detection techniques.
As deployed in the VIM application, this clustering pipeline produces high quality results.
- Score: 6.599344783327053
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mining the latent intentions from large volumes of natural language inputs is
a key step to help data analysts design and refine Intelligent Virtual
Assistants (IVAs) for customer service and sales support. We created a flexible
and scalable clustering pipeline within the Verint Intent Manager (VIM) that
integrates the fine-tuning of language models, a high performing k-NN library
and community detection techniques to help analysts quickly surface and
organize relevant user intentions from conversational texts. The fine-tuning
step is necessary because pre-trained language models cannot encode texts to
efficiently surface particular clustering structures when the target texts are
from an unseen domain or the clustering task is not topic detection. We
describe the pipeline and demonstrate its performance using BERT on three
real-world text mining tasks. As deployed in the VIM application, this
clustering pipeline produces high quality results, improving the performance of
data analysts and reducing the time it takes to surface intentions from
customer service data, thereby reducing the time it takes to build and deploy
IVAs in new domains.
Related papers
- Zero-shot prompt-based classification: topic labeling in times of foundation models in German Tweets [1.734165485480267]
We propose a new tool for automatically annotating text using written guidelines without providing training samples.
Our results show that the prompt-based approach is comparable with the fine-tuned BERT but without any annotated training data.
Our findings emphasize the ongoing paradigm shift in the NLP landscape, i.e., the unification of downstream tasks and elimination of the need for pre-labeled training data.
arXiv Detail & Related papers (2024-06-26T10:44:02Z) - Semantic Parsing in Limited Resource Conditions [19.689433249830465]
The thesis explores challenges in semantic parsing, specifically focusing on scenarios with limited data and computational resources.
It offers solutions using techniques like automatic data curation, knowledge transfer, active learning, and continual learning.
arXiv Detail & Related papers (2023-09-14T05:03:09Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training
Data Exploration [97.68234051078997]
We discuss how Pyserini can be integrated with the Hugging Face ecosystem of open-source AI libraries and artifacts.
We include a Jupyter Notebook-based walk through the core interoperability features, available on GitHub.
We present GAIA Search - a search engine built following previously laid out principles, giving access to four popular large-scale text collections.
arXiv Detail & Related papers (2023-06-02T12:09:59Z) - Harnessing Explanations: LLM-to-LM Interpreter for Enhanced
Text-Attributed Graph Representation Learning [51.90524745663737]
A key innovation is our use of explanations as features, which can be used to boost GNN performance on downstream tasks.
Our method achieves state-of-the-art results on well-established TAG datasets.
Our method significantly speeds up training, achieving a 2.88 times improvement over the closest baseline on ogbn-arxiv.
arXiv Detail & Related papers (2023-05-31T03:18:03Z) - Structured Vision-Language Pretraining for Computational Cooking [54.0571416522547]
Vision-Language Pretraining and Foundation models have been the go-to recipe for achieving SoTA performance on general benchmarks.
We propose to leverage these techniques for structured-text based computational cuisine tasks.
arXiv Detail & Related papers (2022-12-08T13:37:17Z) - Annotated Dataset Creation through General Purpose Language Models for
non-English Medical NLP [0.5482532589225552]
In our work we suggest to leverage pretrained language models for training data acquisition.
We create a custom dataset which we use to train a medical NER model for German texts, GPTNERMED.
arXiv Detail & Related papers (2022-08-30T18:42:55Z) - A Semi-Supervised Deep Clustering Pipeline for Mining Intentions From
Texts [6.599344783327053]
Verint Manager Intent (VIM) is an analysis platform that combines unsupervised and semi-supervised approaches to help analysts quickly surface and organize relevant user intentions from conversational texts.
For the initial exploration of data we make use of a novel unsupervised and semi-supervised pipeline that integrates the fine-tuning of high performing language models.
BERT produces better task-aware representations using a labeled subset as small as 0.5% of the task data.
arXiv Detail & Related papers (2022-02-01T23:01:05Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation
on Natural Speech [44.68649535280397]
We propose a suite of benchmark tasks for Spoken Language Understanding Evaluation (SLUE)
SLUE consists of limited-size labeled training sets and corresponding evaluation sets.
We present the first phase of the SLUE benchmark suite, consisting of named entity recognition, sentiment analysis, and ASR on the corresponding datasets.
We provide new transcriptions and annotations on subsets of the VoxCeleb and VoxPopuli datasets, evaluation metrics and results for baseline models, and an open-source toolkit to reproduce the baselines and evaluate new models.
arXiv Detail & Related papers (2021-11-19T18:59:23Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.