Representation Learning for Short Text Clustering
- URL: http://arxiv.org/abs/2109.09894v1
- Date: Tue, 21 Sep 2021 00:30:24 GMT
- Title: Representation Learning for Short Text Clustering
- Authors: Hui Yin, Xiangyu Song, Shuiqiao Yang, Guangyan Huang and Jianxin Li
- Abstract summary: We propose two methods to exploit the unsupervised autoencoder (AE) framework for optimal clustering performance.
In our first method Structural Text Network Graph Autoencoder (STN-GAE), we exploit the structural text information among the corpus by constructing a text network, and then adopt graph convolutional network as encoder.
In our second method Soft Cluster Assignment Autoencoder (SCA-AE), we adopt an extra soft cluster assignment constraint on the latent space of autoencoder to encourage the learned text representations to be more clustering-friendly.
- Score: 9.896550179440544
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Effective representation learning is critical for short text clustering due
to the sparse, high-dimensional and noise attributes of short text corpus.
Existing pre-trained models (e.g., Word2vec and BERT) have greatly improved the
expressiveness for short text representations with more condensed,
low-dimensional and continuous features compared to the traditional
Bag-of-Words (BoW) model. However, these models are trained for general
purposes and thus are suboptimal for the short text clustering task. In this
paper, we propose two methods to exploit the unsupervised autoencoder (AE)
framework to further tune the short text representations based on these
pre-trained text models for optimal clustering performance. In our first method
Structural Text Network Graph Autoencoder (STN-GAE), we exploit the structural
text information among the corpus by constructing a text network, and then
adopt graph convolutional network as encoder to fuse the structural features
with the pre-trained text features for text representation learning. In our
second method Soft Cluster Assignment Autoencoder (SCA-AE), we adopt an extra
soft cluster assignment constraint on the latent space of autoencoder to
encourage the learned text representations to be more clustering-friendly. We
tested two methods on seven popular short text datasets, and the experimental
results show that when only using the pre-trained model for short text
clustering, BERT performs better than BoW and Word2vec. However, as long as we
further tune the pre-trained representations, the proposed method like SCA-AE
can greatly increase the clustering performance, and the accuracy improvement
compared to use BERT alone could reach as much as 14\%.
Related papers
- Text Clustering with LLM Embeddings [0.0]
The effectiveness of text clustering largely depends on the selection of textual embeddings and clustering algorithms.
Recent advancements in large language models (LLMs) have the potential to enhance this task.
Findings indicate that LLM embeddings are superior at capturing subtleties in structured language.
arXiv Detail & Related papers (2024-03-22T11:08:48Z) - Copy Is All You Need [66.00852205068327]
We formulate text generation as progressively copying text segments from an existing text collection.
Our approach achieves better generation quality according to both automatic and human evaluations.
Our approach attains additional performance gains by simply scaling up to larger text collections.
arXiv Detail & Related papers (2023-07-13T05:03:26Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - JOIST: A Joint Speech and Text Streaming Model For ASR [63.15848310748753]
We present JOIST, an algorithm to train a streaming, cascaded, encoder end-to-end (E2E) model with both speech-text paired inputs, and text-only unpaired inputs.
We find that best text representation for JOIST improves WER across a variety of search and rare-word test sets by 4-14% relative, compared to a model not trained with text.
arXiv Detail & Related papers (2022-10-13T20:59:22Z) - Text Revision by On-the-Fly Representation Optimization [76.11035270753757]
Current state-of-the-art methods formulate these tasks as sequence-to-sequence learning problems.
We present an iterative in-place editing approach for text revision, which requires no parallel data.
It achieves competitive and even better performance than state-of-the-art supervised methods on text simplification.
arXiv Detail & Related papers (2022-04-15T07:38:08Z) - Syntax-Enhanced Pre-trained Model [49.1659635460369]
We study the problem of leveraging the syntactic structure of text to enhance pre-trained models such as BERT and RoBERTa.
Existing methods utilize syntax of text either in the pre-training stage or in the fine-tuning stage, so that they suffer from discrepancy between the two stages.
We present a model that utilizes the syntax of text in both pre-training and fine-tuning stages.
arXiv Detail & Related papers (2020-12-28T06:48:04Z) - Two-Level Transformer and Auxiliary Coherence Modeling for Improved Text
Segmentation [9.416757363901295]
We introduce a novel supervised model for text segmentation with simple but explicit coherence modeling.
Our model -- a neural architecture consisting of two hierarchically connected Transformer networks -- is a multi-task learning model that couples the sentence-level segmentation objective with the coherence objective that differentiates correct sequences of sentences from corrupt ones.
arXiv Detail & Related papers (2020-01-03T17:06:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.