TopicModel4J: A Java Package for Topic Models
- URL: http://arxiv.org/abs/2010.14707v1
- Date: Wed, 28 Oct 2020 02:33:41 GMT
- Title: TopicModel4J: A Java Package for Topic Models
- Authors: Yang Qian, Yuanchun Jiang, Yidong Chai, Yezheng Liu, Jiansha Sun
- Abstract summary: We design and implement a Java package, TopicModel4J, which contains 13 kinds of representative algorithms for fitting topic models.
The package provides an easy-to-use interface for data analysts to run the algorithms, and allow to easily input and output data.
- Score: 2.519906683279153
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Topic models provide a flexible and principled framework for exploring hidden
structure in high-dimensional co-occurrence data and are commonly used natural
language processing (NLP) of text. In this paper, we design and implement a
Java package, TopicModel4J, which contains 13 kinds of representative
algorithms for fitting topic models. The TopicModel4J in the Java programming
environment provides an easy-to-use interface for data analysts to run the
algorithms, and allow to easily input and output data. In addition, this
package provides a few unstructured text preprocessing techniques, such as
splitting textual data into words, lowercasing the words, preforming
lemmatization and removing the useless characters, URLs and stop words.
Related papers
- Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.
We introduce novel methodologies and datasets to overcome these challenges.
We propose MhBART, an encoder-decoder model designed to emulate human writing style.
We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - AceParse: A Comprehensive Dataset with Diverse Structured Texts for Academic Literature Parsing [82.33075210051129]
We introduce AceParse, the first comprehensive dataset designed to support the parsing of structured texts.
Based on AceParse, we fine-tuned a multimodal model, named Ace, which accurately parses various structured texts.
This model outperforms the previous state-of-the-art by 4.1% in terms of F1 score and by 5% in Jaccard Similarity.
arXiv Detail & Related papers (2024-09-16T06:06:34Z) - OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition [79.852642726105]
We propose a unified paradigm for parsing visually-situated text across diverse scenarios.
Specifically, we devise a universal model, called Omni, which can simultaneously handle three typical visually-situated text parsing tasks.
In Omni, all tasks share the unified encoder-decoder architecture, the unified objective point-conditioned text generation, and the unified input representation.
arXiv Detail & Related papers (2024-03-28T03:51:14Z) - TopicGPT: A Prompt-based Topic Modeling Framework [77.72072691307811]
We introduce TopicGPT, a prompt-based framework that uses large language models to uncover latent topics in a text collection.
It produces topics that align better with human categorizations compared to competing methods.
Its topics are also interpretable, dispensing with ambiguous bags of words in favor of topics with natural language labels and associated free-form descriptions.
arXiv Detail & Related papers (2023-11-02T17:57:10Z) - Let the Pretrained Language Models "Imagine" for Short Texts Topic
Modeling [29.87929724277381]
In short texts, co-occurrence information is minimal, which results in feature sparsity in document representation.
Existing topic models (probabilistic or neural) mostly fail to mine patterns from them to generate coherent topics.
We extend short text into longer sequences using existing pre-trained language models (PLMs)
arXiv Detail & Related papers (2023-10-24T00:23:30Z) - A Comprehensive Review of State-of-The-Art Methods for Java Code
Generation from Natural Language Text [0.0]
This paper provides a comprehensive review of the evolution and progress of deep learning models in Java code generation task.
We focus on the most important methods and present their merits and limitations, as well as the objective functions used by the community.
arXiv Detail & Related papers (2023-06-10T07:27:51Z) - Stylized Data-to-Text Generation: A Case Study in the E-Commerce Domain [53.22419717434372]
We propose a new task, namely stylized data-to-text generation, whose aim is to generate coherent text according to a specific style.
This task is non-trivial, due to three challenges: the logic of the generated text, unstructured style reference, and biased training samples.
We propose a novel stylized data-to-text generation model, named StyleD2T, comprising three components: logic planning-enhanced data embedding, mask-based style embedding, and unbiased stylized text generation.
arXiv Detail & Related papers (2023-05-05T03:02:41Z) - JOIST: A Joint Speech and Text Streaming Model For ASR [63.15848310748753]
We present JOIST, an algorithm to train a streaming, cascaded, encoder end-to-end (E2E) model with both speech-text paired inputs, and text-only unpaired inputs.
We find that best text representation for JOIST improves WER across a variety of search and rare-word test sets by 4-14% relative, compared to a model not trained with text.
arXiv Detail & Related papers (2022-10-13T20:59:22Z) - Modelling the semantics of text in complex document layouts using graph
transformer networks [0.0]
We propose a model that approximates the human reading pattern of a document and outputs a unique semantic representation for every text span.
We base our architecture on a graph representation of the structured text, and we demonstrate that not only can we retrieve semantically similar information across documents but also that the embedding space we generate captures useful semantic information.
arXiv Detail & Related papers (2022-02-18T11:49:06Z) - Robust Open-Vocabulary Translation from Visual Text Representations [15.646399508495133]
Machine translation models have discrete and commonly 'open-vocabulary' subword segmentation techniques.
This approach relies on consistent and correct underlying vocabularies.
Motivated by human language processing, we propose the use of visual text representations.
arXiv Detail & Related papers (2021-04-16T16:37:13Z) - Learning to Synthesize Data for Semantic Parsing [57.190817162674875]
We propose a generative model which models the composition of programs and maps a program to an utterance.
Due to the simplicity of PCFG and pre-trained BART, our generative model can be efficiently learned from existing data at hand.
We evaluate our method in both in-domain and out-of-domain settings of text-to-Query parsing on the standard benchmarks of GeoQuery and Spider.
arXiv Detail & Related papers (2021-04-12T21:24:02Z) - BATS: A Spectral Biclustering Approach to Single Document Topic Modeling
and Segmentation [17.003488045214972]
Existing topic modeling and text segmentation methodologies generally require large datasets for training, limiting their capabilities when only small collections of text are available.
In developing a methodology to handle single documents, we face two major challenges.
First is sparse information: with access to only one document, we cannot train traditional topic models or deep learning algorithms.
Second is significant noise: a considerable portion of words in any single document will produce only noise and not help discern topics or segments.
arXiv Detail & Related papers (2020-08-05T16:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.