Investigation of Topic Modelling Methods for Understanding the Reports
of the Mining Projects in Queensland
- URL: http://arxiv.org/abs/2111.03576v1
- Date: Fri, 5 Nov 2021 15:52:03 GMT
- Title: Investigation of Topic Modelling Methods for Understanding the Reports
of the Mining Projects in Queensland
- Authors: Yasuko Okamoto, Thirunavukarasu Balasubramaniam, Richi Nayak
- Abstract summary: In the mining industry, many reports are generated in the project management process.
Document clustering is a powerful approach to cope with the problem.
Three methods, Latent Dirichlet Allocation (LDA), Nonnegative Matrix Factorization (NMF), and Nonnegative Factorization (NTF) are compared.
- Score: 2.610470075814367
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the mining industry, many reports are generated in the project management
process. These past documents are a great resource of knowledge for future
success. However, it would be a tedious and challenging task to retrieve the
necessary information if the documents are unorganized and unstructured.
Document clustering is a powerful approach to cope with the problem, and many
methods have been introduced in past studies. Nonetheless, there is no silver
bullet that can perform the best for any types of documents. Thus, exploratory
studies are required to apply the clustering methods for new datasets. In this
study, we will investigate multiple topic modelling (TM) methods. The
objectives are finding the appropriate approach for the mining project reports
using the dataset of the Geological Survey of Queensland, Department of
Resources, Queensland Government, and understanding the contents to get the
idea of how to organise them. Three TM methods, Latent Dirichlet Allocation
(LDA), Nonnegative Matrix Factorization (NMF), and Nonnegative Tensor
Factorization (NTF) are compared statistically and qualitatively. After the
evaluation, we conclude that the LDA performs the best for the dataset;
however, the possibility remains that the other methods could be adopted with
some improvements.
Related papers
- Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation [9.497148303350697]
We present a case study that extends the application of large language models (LLMs) for data annotation to enhance the quality of existing datasets.
Specifically, we leverage approaches such as chain-of-thought (CoT) and majority voting to imitate human annotation and classify unrelated documents from the Multi-News dataset.
arXiv Detail & Related papers (2024-04-15T11:36:10Z) - A Survey on Data Selection for Language Models [151.6210632830082]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - A Reliable Knowledge Processing Framework for Combustion Science using
Foundation Models [0.0]
The study introduces an approach to process diverse combustion research data, spanning experimental studies, simulations, and literature.
The developed approach minimizes computational and economic expenses while optimizing data privacy and accuracy.
The framework consistently delivers accurate domain-specific responses with minimal human oversight.
arXiv Detail & Related papers (2023-12-31T17:15:25Z) - Capture the Flag: Uncovering Data Insights with Large Language Models [90.47038584812925]
This study explores the potential of using Large Language Models (LLMs) to automate the discovery of insights in data.
We propose a new evaluation methodology based on a "capture the flag" principle, measuring the ability of such models to recognize meaningful and pertinent information (flags) in a dataset.
arXiv Detail & Related papers (2023-12-21T14:20:06Z) - Application of Transformers based methods in Electronic Medical Records:
A Systematic Literature Review [77.34726150561087]
This work presents a systematic literature review of state-of-the-art advances using transformer-based methods on electronic medical records (EMRs) in different NLP tasks.
arXiv Detail & Related papers (2023-04-05T22:19:42Z) - Recent Few-Shot Object Detection Algorithms: A Survey with Performance
Comparison [54.357707168883024]
Few-Shot Object Detection (FSOD) mimics the humans' ability of learning to learn.
FSOD intelligently transfers the learned generic object knowledge from the common heavy-tailed, to the novel long-tailed object classes.
We give an overview of FSOD, including the problem definition, common datasets, and evaluation protocols.
arXiv Detail & Related papers (2022-03-27T04:11:28Z) - A Multi-Document Coverage Reward for RELAXed Multi-Document
Summarization [11.02198476454955]
We propose fine-tuning an MDS baseline with a reward that balances a reference-based metric with coverage of the input documents.
Experimental results over the Multi-News and WCEP MDS datasets show significant improvements of up to +0.95 pp average ROUGE score and +3.17 pp METEOR score over the baseline.
arXiv Detail & Related papers (2022-03-06T07:33:01Z) - Data-to-Value: An Evaluation-First Methodology for Natural Language
Projects [3.9378507882929554]
"Data to Value" (D2V) is a new methodology for big data text analytics projects.
It is guided by a detailed catalog of questions in order to avoid a disconnect between big data text analytics project team and the topic.
arXiv Detail & Related papers (2022-01-19T17:04:52Z) - Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts.
We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data.
We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z) - A Taxonomy of Similarity Metrics for Markov Decision Processes [62.997667081978825]
In recent years, transfer learning has succeeded in making Reinforcement Learning (RL) algorithms more efficient.
In this paper, we propose a categorization of these metrics and analyze the definitions of similarity proposed so far.
arXiv Detail & Related papers (2021-03-08T12:36:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.