Related papers: Going beyond research datasets: Novel intent discovery in the industry setting

Going beyond research datasets: Novel intent discovery in the industry setting

URL: http://arxiv.org/abs/2305.05474v1
Date: Tue, 9 May 2023 14:21:29 GMT
Title: Going beyond research datasets: Novel intent discovery in the industry setting
Authors: Aleksandra Chrabrowa, Tsimur Hadeliya, Dariusz Kajtoch, Robert Mroczkowski, Piotr Rybak
Abstract summary: This paper proposes methods to improve the intent discovery pipeline deployed in a large e-commerce platform. We show the benefit of pre-training language models on in-domain data: both self-supervised and with weak supervision. We also devise the best method to utilize the conversational structure (i.e., question and answer) of real-life datasets during fine-tuning for clustering tasks, which we call Conv.
Score: 60.90117614762879
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Novel intent discovery automates the process of grouping similar messages (questions) to identify previously unknown intents. However, current research focuses on publicly available datasets which have only the question field and significantly differ from real-life datasets. This paper proposes methods to improve the intent discovery pipeline deployed in a large e-commerce platform. We show the benefit of pre-training language models on in-domain data: both self-supervised and with weak supervision. We also devise the best method to utilize the conversational structure (i.e., question and answer) of real-life datasets during fine-tuning for clustering tasks, which we call Conv. All our methods combined to fully utilize real-life datasets give up to 33pp performance boost over state-of-the-art Constrained Deep Adaptive Clustering (CDAC) model for question only. By comparison CDAC model for the question data only gives only up to 13pp performance boost over the naive baseline.

Related papers

Adapting Vision-Language Models Without Labels: A Comprehensive Survey [74.17944178027015]
Vision-Language Models (VLMs) have demonstrated remarkable generalization capabilities across a wide range of tasks.<n>Recent research has increasingly focused on unsupervised adaptation methods that do not rely on labeled data.<n>We propose a taxonomy based on the availability and nature of unlabeled visual data, categorizing existing approaches into four key paradigms.
arXiv Detail & Related papers (2025-08-07T16:27:37Z)
SPILL: Domain-Adaptive Intent Clustering based on Selection and Pooling with Large Language Models [5.257115841810258]
Selection and Pooling with Large Language Models (SPILL) is an intuitive and domain-adaptive method for intent clustering without fine-tuning. Our goal is to make existing embedders more generalizable to new domain datasets without further fine-tuning. Our method achieves comparable results to other state-of-the-art studies, even those that use much larger models and require fine-tuning.
arXiv Detail & Related papers (2025-03-19T15:48:57Z)
DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection [72.25697820290502]
This work introduces a straightforward and efficient strategy to identify potential novel classes through zero-shot classification. We refer to this approach as the self-training strategy, which enhances recall and accuracy for novel classes without requiring extra annotations, datasets, and re-training. Empirical evaluations on three datasets, including LVIS, V3Det, and COCO, demonstrate significant improvements over the baseline performance.
arXiv Detail & Related papers (2023-10-02T17:52:24Z)
Contrastive Continual Multi-view Clustering with Filtered Structural Fusion [57.193645780552565]
Multi-view clustering thrives in applications where views are collected in advance. It overlooks scenarios where data views are collected sequentially, i.e., real-time data. Some methods are proposed to handle it but are trapped in a stability-plasticity dilemma. We propose Contrastive Continual Multi-view Clustering with Filtered Structural Fusion.
arXiv Detail & Related papers (2023-09-26T14:18:29Z)
Revisit Few-shot Intent Classification with PLMs: Direct Fine-tuning vs. Continual Pre-training [20.98770732015944]
Few-shot intent detection involves training a deep learning model to classify utterances based on their underlying intents using only a small amount of labeled data. We show that continual pre-training may not be essential, since the overfitting problem of PLMs on this task may not be as serious as expected. To maximize the utilization of the limited available data, we propose a context augmentation method and leverage sequential self-distillation to boost performance.
arXiv Detail & Related papers (2023-06-08T15:26:52Z)
infoVerse: A Universal Framework for Dataset Characterization with Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization. infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information. In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z)
Dominant Set-based Active Learning for Text Classification and its Application to Online Social Media [0.0]
We present a novel pool-based active learning method for the training of large unlabeled corpus with minimum annotation cost. Our proposed method does not have any parameters to be tuned, making it dataset-independent. Our method achieves a higher performance in comparison to the state-of-the-art active learning strategies.
arXiv Detail & Related papers (2022-01-28T19:19:03Z)
Towards General and Efficient Active Learning [20.888364610175987]
Active learning aims to select the most informative samples to exploit limited annotation budgets. We propose a novel general and efficient active learning (GEAL) method in this paper. Our method can conduct data selection processes on different datasets with a single-pass inference of the same model.
arXiv Detail & Related papers (2021-12-15T08:35:28Z)
Enhancing the Generalization for Intent Classification and Out-of-Domain Detection in SLU [70.44344060176952]
Intent classification is a major task in spoken language understanding (SLU) Recent works have shown that using extra data and labels can improve the OOD detection performance. This paper proposes to train a model with only IND data while supporting both IND intent classification and OOD detection.
arXiv Detail & Related papers (2021-06-28T08:27:38Z)
The Surprising Performance of Simple Baselines for Misinformation Detection [4.060731229044571]
We examine the performance of a broad set of modern transformer-based language models. We present our framework as a baseline for creating and evaluating new methods for misinformation detection.
arXiv Detail & Related papers (2021-04-14T16:25:22Z)
Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings. We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data. We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z)
Self-Supervision based Task-Specific Image Collection Summarization [3.115375810642661]
We propose a novel approach to task-specific image corpus summarization using semantic information and self-supervision. Our method uses a classification-based Wasserstein generative adversarial network (WGAN) as a feature generating network. The model then generates a summary at inference time by using K-means clustering in the semantic embedding space.
arXiv Detail & Related papers (2020-12-19T10:58:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.