Related papers: Transforming Unstructured Text into Data with Context Rule Assisted Machine Learning (CRAML)

Transforming Unstructured Text into Data with Context Rule Assisted Machine Learning (CRAML)

URL: http://arxiv.org/abs/2301.08549v1
Date: Fri, 20 Jan 2023 13:12:35 GMT
Title: Transforming Unstructured Text into Data with Context Rule Assisted Machine Learning (CRAML)
Authors: Stephen Meisenbacher, Peter Norlander
Abstract summary: The Context Rule Assisted Machine Learning (CRAML) method allows accurate and reproducible labeling of massive volumes of unstructured text. CRAML enables domain experts to access uncommon constructs buried within a document corpus. We present three use cases for CRAML: we analyze recent management literature that draws from text data, describe and release new machine learning models from an analysis of proprietary job advertisement text, and present findings of social and economic interest from a public corpus of franchise documents.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We describe a method and new no-code software tools enabling domain experts to build custom structured, labeled datasets from the unstructured text of documents and build niche machine learning text classification models traceable to expert-written rules. The Context Rule Assisted Machine Learning (CRAML) method allows accurate and reproducible labeling of massive volumes of unstructured text. CRAML enables domain experts to access uncommon constructs buried within a document corpus, and avoids limitations of current computational approaches that often lack context, transparency, and interpetability. In this research methods paper, we present three use cases for CRAML: we analyze recent management literature that draws from text data, describe and release new machine learning models from an analysis of proprietary job advertisement text, and present findings of social and economic interest from a public corpus of franchise documents. CRAML produces document-level coded tabular datasets that can be used for quantitative academic research, and allows qualitative researchers to scale niche classification schemes over massive text data. CRAML is a low-resource, flexible, and scalable methodology for building training data for supervised ML. We make available as open-source resources: the software, job advertisement text classifiers, a novel corpus of franchise documents, and a fully replicable start-to-finish trained example in the context of no poach clauses.

Related papers

Spatial Information Integration in Small Language Models for Document Layout Generation and Classification [0.0]
Document layout understanding is a field of study that analyzes the spatial arrangement of information in a document to understand its structure and layout. While semi-structured data is common in everyday life (balance sheets, purchase orders, receipts), there is a lack of public datasets for training machine learning models for this type of document. We propose a method to generate new, synthetic, layout information that can help overcoming this data shortage.
arXiv Detail & Related papers (2025-01-09T17:20:00Z)
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [70.72097493954067]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems. While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited. We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z)
fPLSA: Learning Semantic Structures in Document Collections Using Foundation Models [19.099810900404357]
We introduce fPLSA, a foundation-model-based Probabilistic Latent Semantic Analysis (PLSA) method. PLSA iteratively clusters and tags document segments based on document-level contexts. Our experiments on story writing, math, and multi-step reasoning datasets demonstrate that fPLSA tags help reconstruct the original texts better than existing tagging methods.
arXiv Detail & Related papers (2024-10-07T20:25:52Z)
Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored. We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches. We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z)
From Text to Insight: Large Language Models for Materials Science Data Extraction [4.08853418443192]
The vast majority of materials science knowledge exists in unstructured natural language. Structured data is crucial for innovative and systematic materials design. The advent of large language models (LLMs) represents a significant shift.
arXiv Detail & Related papers (2024-07-23T22:23:47Z)
Systematic Task Exploration with LLMs: A Study in Citation Text Generation [63.50597360948099]
Large language models (LLMs) bring unprecedented flexibility in defining and executing complex, creative natural language generation (NLG) tasks. We propose a three-component research framework that consists of systematic input manipulation, reference data, and output measurement. We use this framework to explore citation text generation -- a popular scholarly NLP task that lacks consensus on the task definition and evaluation metric.
arXiv Detail & Related papers (2024-07-04T16:41:08Z)
From Text Segmentation to Smart Chaptering: A Novel Benchmark for Structuring Video Transcriptions [63.11097464396147]
We introduce a novel benchmark YTSeg focusing on spoken content that is inherently more unstructured and both topically and structurally diverse. We also introduce an efficient hierarchical segmentation model MiniSeg, that outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-27T15:59:37Z)
Object Recognition from Scientific Document based on Compartment Refinement Framework [2.699900017799093]
It has become increasingly important to extract valuable information from vast resources efficiently. Current data extraction methods for scientific documents typically use rule-based (RB) or machine learning (ML) approaches. We propose a new document layout analysis framework called CTBR(Compartment & Text Blocks Refinement)
arXiv Detail & Related papers (2023-12-14T15:36:49Z)
DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge. Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z)
Interactive Distillation of Large Single-Topic Corpora of Scientific Papers [1.2954493726326113]
A more robust but time-consuming approach is to build the dataset constructively in which a subject matter expert handpicks documents. Here we showcase a new tool, based on machine learning, for constructively generating targeted datasets of scientific literature.
arXiv Detail & Related papers (2023-09-19T17:18:36Z)
Schema-aware Reference as Prompt Improves Data-Efficient Knowledge Graph Construction [57.854498238624366]
We propose a retrieval-augmented approach, which retrieves schema-aware Reference As Prompt (RAP) for data-efficient knowledge graph construction. RAP can dynamically leverage schema and knowledge inherited from human-annotated and weak-supervised data as a prompt for each sample.
arXiv Detail & Related papers (2022-10-19T16:40:28Z)
Case Studies on using Natural Language Processing Techniques in Customer Relationship Management Software [0.0]
We trained word embeddings by using the corresponding text corpus and showed that these word embeddings can not only be used directly for data mining but also be used in RNN architectures. The results prove that structured text data in a CRM can be used to mine out very valuable information.
arXiv Detail & Related papers (2021-06-09T16:07:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.