Related papers: Investigation of Topic Modelling Methods for Understanding the Reports of the Mining Projects in Queensland

Investigation of Topic Modelling Methods for Understanding the Reports of the Mining Projects in Queensland

URL: http://arxiv.org/abs/2111.03576v1
Date: Fri, 5 Nov 2021 15:52:03 GMT
Title: Investigation of Topic Modelling Methods for Understanding the Reports of the Mining Projects in Queensland
Authors: Yasuko Okamoto, Thirunavukarasu Balasubramaniam, Richi Nayak
Abstract summary: In the mining industry, many reports are generated in the project management process. Document clustering is a powerful approach to cope with the problem. Three methods, Latent Dirichlet Allocation (LDA), Nonnegative Matrix Factorization (NMF), and Nonnegative Factorization (NTF) are compared.
Score: 2.610470075814367
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the mining industry, many reports are generated in the project management process. These past documents are a great resource of knowledge for future success. However, it would be a tedious and challenging task to retrieve the necessary information if the documents are unorganized and unstructured. Document clustering is a powerful approach to cope with the problem, and many methods have been introduced in past studies. Nonetheless, there is no silver bullet that can perform the best for any types of documents. Thus, exploratory studies are required to apply the clustering methods for new datasets. In this study, we will investigate multiple topic modelling (TM) methods. The objectives are finding the appropriate approach for the mining project reports using the dataset of the Geological Survey of Queensland, Department of Resources, Queensland Government, and understanding the contents to get the idea of how to organise them. Three TM methods, Latent Dirichlet Allocation (LDA), Nonnegative Matrix Factorization (NMF), and Nonnegative Tensor Factorization (NTF) are compared statistically and qualitatively. After the evaluation, we conclude that the LDA performs the best for the dataset; however, the possibility remains that the other methods could be adopted with some improvements.

Related papers

Missing Data in Signal Processing and Machine Learning: Models, Methods and Modern Approaches [49.431846265898486]
This tutorial aims to provide signal processing (SP) and machine learning (ML) practitioners with vital tools to answer the question: How to deal with missing data?
arXiv Detail & Related papers (2025-06-02T13:58:36Z)
SnipGen: A Mining Repository Framework for Evaluating LLMs for Code [51.07471575337676]
Language Models (LLMs) are trained on extensive datasets that include code repositories. evaluating their effectiveness poses significant challenges due to the potential overlap between the datasets used for training and those employed for evaluation. We introduce SnipGen, a comprehensive repository mining framework designed to leverage prompt engineering across various downstream tasks for code generation.
arXiv Detail & Related papers (2025-02-10T21:28:15Z)
A Survey of Small Language Models [104.80308007044634]
Small Language Models (SLMs) have become increasingly important due to their efficiency and performance to perform various language tasks with minimal computational resources. We present a comprehensive survey on SLMs, focusing on their architectures, training techniques, and model compression techniques.
arXiv Detail & Related papers (2024-10-25T23:52:28Z)
Enhancing literature review with LLM and NLP methods. Algorithmic trading case [0.0]
This study utilizes machine learning algorithms to analyze and organize knowledge in the field of algorithmic trading. By filtering a dataset of 136 million research papers, we identified 14,342 relevant articles published between 1956 and Q1 2020.
arXiv Detail & Related papers (2024-10-23T13:37:27Z)
Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored. We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches. We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z)
Out-of-Distribution Detection with Attention Head Masking for Multimodal Document Classification [3.141006099594433]
We propose a novel methodology termed as attention head masking (AHM) for multi-modal OOD tasks in document classification systems. Our empirical results demonstrate that the proposed AHM method outperforms all state-of-the-art approaches. To address the scarcity of high-quality publicly available document datasets, we introduce FinanceDocs, a new document AI dataset.
arXiv Detail & Related papers (2024-08-20T23:30:00Z)
Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation [9.497148303350697]
We present a case study that extends the application of LLM-based data annotation to enhance the quality of existing datasets through a cleansing strategy. Specifically, we leverage approaches such as chain-of-thought and majority voting to imitate human annotation and classify unrelated documents from the Multi-News dataset.
arXiv Detail & Related papers (2024-04-15T11:36:10Z)
A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset. Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive. Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z)
Capture the Flag: Uncovering Data Insights with Large Language Models [90.47038584812925]
This study explores the potential of using Large Language Models (LLMs) to automate the discovery of insights in data. We propose a new evaluation methodology based on a "capture the flag" principle, measuring the ability of such models to recognize meaningful and pertinent information (flags) in a dataset.
arXiv Detail & Related papers (2023-12-21T14:20:06Z)
Application of Transformers based methods in Electronic Medical Records: A Systematic Literature Review [77.34726150561087]
This work presents a systematic literature review of state-of-the-art advances using transformer-based methods on electronic medical records (EMRs) in different NLP tasks.
arXiv Detail & Related papers (2023-04-05T22:19:42Z)
Recent Few-Shot Object Detection Algorithms: A Survey with Performance Comparison [54.357707168883024]
Few-Shot Object Detection (FSOD) mimics the humans' ability of learning to learn. FSOD intelligently transfers the learned generic object knowledge from the common heavy-tailed, to the novel long-tailed object classes. We give an overview of FSOD, including the problem definition, common datasets, and evaluation protocols.
arXiv Detail & Related papers (2022-03-27T04:11:28Z)
Data-to-Value: An Evaluation-First Methodology for Natural Language Projects [3.9378507882929554]
"Data to Value" (D2V) is a new methodology for big data text analytics projects. It is guided by a detailed catalog of questions in order to avoid a disconnect between big data text analytics project team and the topic.
arXiv Detail & Related papers (2022-01-19T17:04:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.