Related papers: SPILL: Domain-Adaptive Intent Clustering based on Selection and Pooling with Large Language Models

SPILL: Domain-Adaptive Intent Clustering based on Selection and Pooling with Large Language Models

URL: http://arxiv.org/abs/2503.15351v1
Date: Wed, 19 Mar 2025 15:48:57 GMT
Title: SPILL: Domain-Adaptive Intent Clustering based on Selection and Pooling with Large Language Models
Authors: I-Fan Lin, Faegheh Hasibi, Suzan Verberne,
Abstract summary: Selection and Pooling with Large Language Models (SPILL) is an intuitive and domain-adaptive method for intent clustering without fine-tuning.<n>Our goal is to make existing embedders more generalizable to new domain datasets without further fine-tuning.<n>Our method achieves comparable results to other state-of-the-art studies, even those that use much larger models and require fine-tuning.
Score: 5.257115841810258
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we propose Selection and Pooling with Large Language Models (SPILL), an intuitive and domain-adaptive method for intent clustering without fine-tuning. Existing embeddings-based clustering methods rely on a few labeled examples or unsupervised fine-tuning to optimize results for each new dataset, which makes them less generalizable to multiple datasets. Our goal is to make these existing embedders more generalizable to new domain datasets without further fine-tuning. Inspired by our theoretical derivation and simulation results on the effectiveness of sampling and pooling techniques, we view the clustering task as a small-scale selection problem. A good solution to this problem is associated with better clustering performance. Accordingly, we propose a two-stage approach: First, for each utterance (referred to as the seed), we derive its embedding using an existing embedder. Then, we apply a distance metric to select a pool of candidates close to the seed. Because the embedder is not optimized for new datasets, in the second stage, we use an LLM to further select utterances from these candidates that share the same intent as the seed. Finally, we pool these selected candidates with the seed to derive a refined embedding for the seed. We found that our method generally outperforms directly using an embedder, and it achieves comparable results to other state-of-the-art studies, even those that use much larger models and require fine-tuning, showing its strength and efficiency. Our results indicate that our method enables existing embedders to be further improved without additional fine-tuning, making them more adaptable to new domain datasets. Additionally, viewing the clustering task as a small-scale selection problem gives the potential of using LLMs to customize clustering tasks according to the user's goals.

Related papers

Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes [50.544186914115045]
Large language models (LLMs) are increasingly embedded in everyday applications. Ensuring their alignment with the diverse preferences of individual users has become a critical challenge. We present a novel framework for few-shot steerable alignment.
arXiv Detail & Related papers (2024-12-18T16:14:59Z)
Adapt-$\infty$: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for Lifelong Instruction Tuning. We construct pseudo-skill clusters by grouping gradient-based sample vectors. We select the best-performing data selector for each skill cluster from a pool of selector experts.
arXiv Detail & Related papers (2024-10-14T15:48:09Z)
CLAMS: A System for Zero-Shot Model Selection for Clustering [3.7127285734321194]
We propose an AutoML system that enables model selection on clustering problems by leveraging optimal transport-based dataset similarity. We compare our results against multiple clustering baselines and find that it outperforms all of them, hence demonstrating the utility of similarity-based automated model selection for solving clustering applications.
arXiv Detail & Related papers (2024-07-15T23:50:07Z)
BWS: Best Window Selection Based on Sample Scores for Data Pruning across Broad Ranges [12.248397169100784]
Data subset selection aims to find a smaller yet informative subset of a large dataset that can approximate the full-dataset training. We introduce a universal and efficient data subset selection method, Best Window Selection (BWS), by proposing a method to choose the best window subset from samples ordered based on their difficulty scores.
arXiv Detail & Related papers (2024-06-05T08:33:09Z)
Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs [18.242110417706]
This work focuses on leveraging and selecting from vast, unlabeled, open data to pre-fine-tune a pre-trained language model. We show the optimality of this approach for fine-tuning tasks under certain conditions. Our proposed method is significantly faster than existing techniques, scaling to millions of samples within a single GPU hour.
arXiv Detail & Related papers (2024-05-05T00:08:00Z)
DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality. We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data. Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z)
End-to-end Learnable Clustering for Intent Learning in Recommendation [54.157784572994316]
We propose a novel intent learning method termed underlineELCRec. It unifies behavior representation learning into an underlineEnd-to-end underlineLearnable underlineClustering framework. We deploy this method on the industrial recommendation system with 130 million page views and achieve promising results.
arXiv Detail & Related papers (2024-01-11T15:22:55Z)
Generalized Category Discovery with Large Language Models in the Loop [10.440661581492723]
We propose Loop, an end-to-end active-learning framework that introduces Large Language Models into the training loop. We show that Loop outperforms SOTA models by a large margin and generates accurate category names for the discovered clusters.
arXiv Detail & Related papers (2023-12-18T02:55:14Z)
Going beyond research datasets: Novel intent discovery in the industry setting [60.90117614762879]
This paper proposes methods to improve the intent discovery pipeline deployed in a large e-commerce platform. We show the benefit of pre-training language models on in-domain data: both self-supervised and with weak supervision. We also devise the best method to utilize the conversational structure (i.e., question and answer) of real-life datasets during fine-tuning for clustering tasks, which we call Conv.
arXiv Detail & Related papers (2023-05-09T14:21:29Z)
Hard Regularization to Prevent Deep Online Clustering Collapse without Data Augmentation [65.268245109828]
Online deep clustering refers to the joint use of a feature extraction network and a clustering model to assign cluster labels to each new data point or batch as it is processed. While faster and more versatile than offline methods, online clustering can easily reach the collapsed solution where the encoder maps all inputs to the same point and all are put into a single cluster. We propose a method that does not require data augmentation, and that, differently from existing methods, regularizes the hard assignments.
arXiv Detail & Related papers (2023-03-29T08:23:26Z)
S$^3$VAADA: Submodular Subset Selection for Virtual Adversarial Active Domain Adaptation [49.01925978752677]
In the real-world scenario's it might be feasible to get labels for a small proportion of target data. We propose S$3$VAADA which i) introduces a novel submodular criterion to select a maximally informative subset to label and ii) enhances a cluster-based DA procedure. Our approach consistently outperforms the competing state-of-the-art approaches on datasets with varying degrees of domain shifts.
arXiv Detail & Related papers (2021-09-18T10:53:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.