Evolving Text Data Stream Mining
- URL: http://arxiv.org/abs/2409.00010v1
- Date: Thu, 15 Aug 2024 15:38:52 GMT
- Title: Evolving Text Data Stream Mining
- Authors: Jay Kumar,
- Abstract summary: A massive amount of such text data is generated by online social platforms every day.
Learning useful information from such streaming data under the constraint of limited time and memory has gained increasing attention.
New learning models are proposed for clustering and multi-label learning on text streams.
- Score: 2.28438857884398
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: A text stream is an ordered sequence of text documents generated over time. A massive amount of such text data is generated by online social platforms every day. Designing an algorithm for such text streams to extract useful information is a challenging task due to unique properties of the stream such as infinite length, data sparsity, and evolution. Thereby, learning useful information from such streaming data under the constraint of limited time and memory has gained increasing attention. During the past decade, although many text stream mining algorithms have proposed, there still exists some potential issues. First, high-dimensional text data heavily degrades the learning performance until the model either works on subspace or reduces the global feature space. The second issue is to extract semantic text representation of documents and capture evolving topics over time. Moreover, the problem of label scarcity exists, whereas existing approaches work on the full availability of labeled data. To deal with these issues, in this thesis, new learning models are proposed for clustering and multi-label learning on text streams.
Related papers
- Text2Data: Low-Resource Data Generation with Textual Control [104.38011760992637]
Natural language serves as a common and straightforward control signal for humans to interact seamlessly with machines.
We propose Text2Data, a novel approach that utilizes unlabeled data to understand the underlying data distribution through an unsupervised diffusion model.
It undergoes controllable finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z) - Hierarchical Knowledge Distillation on Text Graph for Data-limited
Attribute Inference [5.618638372635474]
We develop a text-graph-based few-shot learning model for attribute inferences on social media text data.
Our model first constructs and refines a text graph using manifold learning and message passing.
To further use cross-domain texts and unlabeled texts to improve few-shot performance, a hierarchical knowledge distillation is devised over text graph.
arXiv Detail & Related papers (2024-01-10T05:50:34Z) - Harnessing Explanations: LLM-to-LM Interpreter for Enhanced
Text-Attributed Graph Representation Learning [51.90524745663737]
A key innovation is our use of explanations as features, which can be used to boost GNN performance on downstream tasks.
Our method achieves state-of-the-art results on well-established TAG datasets.
Our method significantly speeds up training, achieving a 2.88 times improvement over the closest baseline on ogbn-arxiv.
arXiv Detail & Related papers (2023-05-31T03:18:03Z) - Graph-based Semantical Extractive Text Analysis [0.0]
In this work, we improve the results of the TextRank algorithm by incorporating the semantic similarity between parts of the text.
Aside from keyword extraction and text summarization, we develop a topic clustering algorithm based on our framework.
arXiv Detail & Related papers (2022-12-19T18:30:26Z) - Event Transition Planning for Open-ended Text Generation [55.729259805477376]
Open-ended text generation tasks require models to generate a coherent continuation given limited preceding context.
We propose a novel two-stage method which explicitly arranges the ensuing events in open-ended text generation.
Our approach can be understood as a specially-trained coarse-to-fine algorithm.
arXiv Detail & Related papers (2022-04-20T13:37:51Z) - SCROLLS: Standardized CompaRison Over Long Language Sequences [62.574959194373264]
We introduce SCROLLS, a suite of tasks that require reasoning over long texts.
SCROLLS contains summarization, question answering, and natural language inference tasks.
We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
arXiv Detail & Related papers (2022-01-10T18:47:15Z) - Text-Based Person Search with Limited Data [66.26504077270356]
Text-based person search (TBPS) aims at retrieving a target person from an image gallery with a descriptive text query.
We present a framework with two novel components to handle the problems brought by limited data.
arXiv Detail & Related papers (2021-10-20T22:20:47Z) - Rethinking Text Segmentation: A Novel Dataset and A Text-Specific
Refinement Approach [34.63444886780274]
Text segmentation is a prerequisite in real-world text-related tasks.
We introduce Text Refinement Network (TexRNet), a novel text segmentation approach.
TexRNet consistently improves text segmentation performance by nearly 2% compared to other state-of-the-art segmentation methods.
arXiv Detail & Related papers (2020-11-27T22:50:09Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.