Investigation of Topic Modelling Methods for Understanding the Reports
of the Mining Projects in Queensland
- URL: http://arxiv.org/abs/2111.03576v1
- Date: Fri, 5 Nov 2021 15:52:03 GMT
- Title: Investigation of Topic Modelling Methods for Understanding the Reports
of the Mining Projects in Queensland
- Authors: Yasuko Okamoto, Thirunavukarasu Balasubramaniam, Richi Nayak
- Abstract summary: In the mining industry, many reports are generated in the project management process.
Document clustering is a powerful approach to cope with the problem.
Three methods, Latent Dirichlet Allocation (LDA), Nonnegative Matrix Factorization (NMF), and Nonnegative Factorization (NTF) are compared.
- Score: 2.610470075814367
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the mining industry, many reports are generated in the project management
process. These past documents are a great resource of knowledge for future
success. However, it would be a tedious and challenging task to retrieve the
necessary information if the documents are unorganized and unstructured.
Document clustering is a powerful approach to cope with the problem, and many
methods have been introduced in past studies. Nonetheless, there is no silver
bullet that can perform the best for any types of documents. Thus, exploratory
studies are required to apply the clustering methods for new datasets. In this
study, we will investigate multiple topic modelling (TM) methods. The
objectives are finding the appropriate approach for the mining project reports
using the dataset of the Geological Survey of Queensland, Department of
Resources, Queensland Government, and understanding the contents to get the
idea of how to organise them. Three TM methods, Latent Dirichlet Allocation
(LDA), Nonnegative Matrix Factorization (NMF), and Nonnegative Tensor
Factorization (NTF) are compared statistically and qualitatively. After the
evaluation, we conclude that the LDA performs the best for the dataset;
however, the possibility remains that the other methods could be adopted with
some improvements.
Related papers
- A Survey of Small Language Models [104.80308007044634]
Small Language Models (SLMs) have become increasingly important due to their efficiency and performance to perform various language tasks with minimal computational resources.
We present a comprehensive survey on SLMs, focusing on their architectures, training techniques, and model compression techniques.
arXiv Detail & Related papers (2024-10-25T23:52:28Z) - Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - Out-of-Distribution Detection with Attention Head Masking for Multimodal Document Classification [3.141006099594433]
We propose a novel methodology termed as attention head masking (AHM) for multi-modal OOD tasks in document classification systems.
Our empirical results demonstrate that the proposed AHM method outperforms all state-of-the-art approaches.
To address the scarcity of high-quality publicly available document datasets, we introduce FinanceDocs, a new document AI dataset.
arXiv Detail & Related papers (2024-08-20T23:30:00Z) - Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation [9.497148303350697]
We present a case study that extends the application of LLM-based data annotation to enhance the quality of existing datasets through a cleansing strategy.
Specifically, we leverage approaches such as chain-of-thought and majority voting to imitate human annotation and classify unrelated documents from the Multi-News dataset.
arXiv Detail & Related papers (2024-04-15T11:36:10Z) - A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - Capture the Flag: Uncovering Data Insights with Large Language Models [90.47038584812925]
This study explores the potential of using Large Language Models (LLMs) to automate the discovery of insights in data.
We propose a new evaluation methodology based on a "capture the flag" principle, measuring the ability of such models to recognize meaningful and pertinent information (flags) in a dataset.
arXiv Detail & Related papers (2023-12-21T14:20:06Z) - Application of Transformers based methods in Electronic Medical Records:
A Systematic Literature Review [77.34726150561087]
This work presents a systematic literature review of state-of-the-art advances using transformer-based methods on electronic medical records (EMRs) in different NLP tasks.
arXiv Detail & Related papers (2023-04-05T22:19:42Z) - Recent Few-Shot Object Detection Algorithms: A Survey with Performance
Comparison [54.357707168883024]
Few-Shot Object Detection (FSOD) mimics the humans' ability of learning to learn.
FSOD intelligently transfers the learned generic object knowledge from the common heavy-tailed, to the novel long-tailed object classes.
We give an overview of FSOD, including the problem definition, common datasets, and evaluation protocols.
arXiv Detail & Related papers (2022-03-27T04:11:28Z) - Data-to-Value: An Evaluation-First Methodology for Natural Language
Projects [3.9378507882929554]
"Data to Value" (D2V) is a new methodology for big data text analytics projects.
It is guided by a detailed catalog of questions in order to avoid a disconnect between big data text analytics project team and the topic.
arXiv Detail & Related papers (2022-01-19T17:04:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.