BATS: A Spectral Biclustering Approach to Single Document Topic Modeling
and Segmentation
- URL: http://arxiv.org/abs/2008.02218v3
- Date: Tue, 25 May 2021 11:38:03 GMT
- Title: BATS: A Spectral Biclustering Approach to Single Document Topic Modeling
and Segmentation
- Authors: Qiong Wu, Adam Hare, Sirui Wang, Yuwei Tu, Zhenming Liu, Christopher
G. Brinton, Yanhua Li
- Abstract summary: Existing topic modeling and text segmentation methodologies generally require large datasets for training, limiting their capabilities when only small collections of text are available.
In developing a methodology to handle single documents, we face two major challenges.
First is sparse information: with access to only one document, we cannot train traditional topic models or deep learning algorithms.
Second is significant noise: a considerable portion of words in any single document will produce only noise and not help discern topics or segments.
- Score: 17.003488045214972
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing topic modeling and text segmentation methodologies generally require
large datasets for training, limiting their capabilities when only small
collections of text are available. In this work, we reexamine the inter-related
problems of "topic identification" and "text segmentation" for sparse document
learning, when there is a single new text of interest. In developing a
methodology to handle single documents, we face two major challenges. First is
sparse information: with access to only one document, we cannot train
traditional topic models or deep learning algorithms. Second is significant
noise: a considerable portion of words in any single document will produce only
noise and not help discern topics or segments. To tackle these issues, we
design an unsupervised, computationally efficient methodology called BATS:
Biclustering Approach to Topic modeling and Segmentation. BATS leverages three
key ideas to simultaneously identify topics and segment text: (i) a new
mechanism that uses word order information to reduce sample complexity, (ii) a
statistically sound graph-based biclustering technique that identifies latent
structures of words and sentences, and (iii) a collection of effective
heuristics that remove noise words and award important words to further improve
performance. Experiments on four datasets show that our approach outperforms
several state-of-the-art baselines when considering topic coherence, topic
diversity, segmentation, and runtime comparison metrics.
Related papers
- Semantic Component Analysis: Discovering Patterns in Short Texts Beyond Topics [7.784473631566062]
We introduce Semantic Component Analysis (SCA), a novel topic modeling technique.
We find multiple, nuanced semantic components beyond a single topic in short texts.
Evaluated on multiple Twitter datasets, SCA matches the state-of-the-art method BERTopic in coherence and diversity.
arXiv Detail & Related papers (2024-10-28T14:09:52Z) - CAST: Corpus-Aware Self-similarity Enhanced Topic modelling [16.562349140796115]
We introduce CAST: Corpus-Aware Self-similarity Enhanced Topic modelling, a novel topic modelling method.
We find self-similarity to be an effective metric to prevent functional words from acting as candidate topic words.
Our approach significantly enhances the coherence and diversity of generated topics, as well as the topic model's ability to handle noisy data.
arXiv Detail & Related papers (2024-10-19T15:27:11Z) - From Text Segmentation to Smart Chaptering: A Novel Benchmark for
Structuring Video Transcriptions [63.11097464396147]
We introduce a novel benchmark YTSeg focusing on spoken content that is inherently more unstructured and both topically and structurally diverse.
We also introduce an efficient hierarchical segmentation model MiniSeg, that outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-27T15:59:37Z) - Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation.
It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z) - Topics in the Haystack: Extracting and Evaluating Topics beyond
Coherence [0.0]
We propose a method that incorporates a deeper understanding of both sentence and document themes.
This allows our model to detect latent topics that may include uncommon words or neologisms.
We present correlation coefficients with human identification of intruder words and achieve near-human level results at the word-intrusion task.
arXiv Detail & Related papers (2023-03-30T12:24:25Z) - Topic Segmentation Model Focusing on Local Context [1.9871897882042773]
We propose siamese sentence embedding layers which process two input sentences independently to get appropriate amount of information.
Also, we adopt multi-task learning techniques including Same Topic Prediction (STP), Topic Classification (TC) and Next Sentence Prediction (NSP)
arXiv Detail & Related papers (2023-01-05T06:57:42Z) - Toward Unifying Text Segmentation and Long Document Summarization [31.084738269628748]
We study the role that section segmentation plays in extractive summarization of written and spoken documents.
Our approach learns robust sentence representations by performing summarization and segmentation simultaneously.
Our findings suggest that the model can not only achieve state-of-the-art performance on publicly available benchmarks, but demonstrate better cross-genre transferability.
arXiv Detail & Related papers (2022-10-28T22:07:10Z) - Open-world Semantic Segmentation via Contrasting and Clustering
Vision-Language Embedding [95.78002228538841]
We propose a new open-world semantic segmentation pipeline that makes the first attempt to learn to segment semantic objects of various open-world categories without any efforts on dense annotations.
Our method can directly segment objects of arbitrary categories, outperforming zero-shot segmentation methods that require data labeling on three benchmark datasets.
arXiv Detail & Related papers (2022-07-18T09:20:04Z) - Distant finetuning with discourse relations for stance classification [55.131676584455306]
We propose a new method to extract data with silver labels from raw text to finetune a model for stance classification.
We also propose a 3-stage training framework where the noisy level in the data used for finetuning decreases over different stages.
Our approach ranks 1st among 26 competing teams in the stance classification track of the NLPCC 2021 shared task Argumentative Text Understanding for AI Debater.
arXiv Detail & Related papers (2022-04-27T04:24:35Z) - Multi-View Sequence-to-Sequence Models with Conversational Structure for
Abstractive Dialogue Summarization [72.54873655114844]
Text summarization is one of the most challenging and interesting problems in NLP.
This work proposes a multi-view sequence-to-sequence model by first extracting conversational structures of unstructured daily chats from different views to represent conversations.
Experiments on a large-scale dialogue summarization corpus demonstrated that our methods significantly outperformed previous state-of-the-art models via both automatic evaluations and human judgment.
arXiv Detail & Related papers (2020-10-04T20:12:44Z) - Topic-Aware Multi-turn Dialogue Modeling [91.52820664879432]
This paper presents a novel solution for multi-turn dialogue modeling, which segments and extracts topic-aware utterances in an unsupervised way.
Our topic-aware modeling is implemented by a newly proposed unsupervised topic-aware segmentation algorithm and Topic-Aware Dual-attention Matching (TADAM) Network.
arXiv Detail & Related papers (2020-09-26T08:43:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.