From Strings to Data Science: a Practical Framework for Automated String
Handling
- URL: http://arxiv.org/abs/2111.01868v2
- Date: Thu, 4 Nov 2021 08:14:12 GMT
- Title: From Strings to Data Science: a Practical Framework for Automated String
Handling
- Authors: John W. van Lith and Joaquin Vanschoren
- Abstract summary: Many machine learning libraries require that string features be converted to a numerical representation for the models to work as intended.
In this paper, we propose a framework to do so based on best practices, domain knowledge, and novel techniques.
It automatically identifies different types of string features, processes them accordingly, and encodes them into numerical representations.
- Score: 0.4079265319364249
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many machine learning libraries require that string features be converted to
a numerical representation for the models to work as intended. Categorical
string features can represent a wide variety of data (e.g., zip codes, names,
marital status), and are notoriously difficult to preprocess automatically. In
this paper, we propose a framework to do so based on best practices, domain
knowledge, and novel techniques. It automatically identifies different types of
string features, processes them accordingly, and encodes them into numerical
representations. We also provide an open source Python implementation to
automatically preprocess categorical string data in tabular datasets and
demonstrate promising results on a wide range of datasets.
Related papers
- Vectorizing string entries for data processing on tables: when are
larger language models better? [1.0840985826142429]
We study the benefits of language models in 14 analytical tasks on tables.
We show that larger language models tend to perform better, but it is useful to fine tune them for embedding purposes.
arXiv Detail & Related papers (2023-12-15T09:23:56Z) - Deepfake audio as a data augmentation technique for training automatic
speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio.
A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z) - Large Language Models for Automated Data Science: Introducing CAAFE for
Context-Aware Automated Feature Engineering [52.09178018466104]
We introduce Context-Aware Automated Feature Engineering (CAAFE) to generate semantically meaningful features for datasets.
Despite being methodologically simple, CAAFE improves performance on 11 out of 14 datasets.
We highlight the significance of context-aware solutions that can extend the scope of AutoML systems to semantic AutoML.
arXiv Detail & Related papers (2023-05-05T09:58:40Z) - TabLLM: Few-shot Classification of Tabular Data with Large Language
Models [66.03023402174138]
We study the application of large language models to zero-shot and few-shot classification.
We evaluate several serialization methods including templates, table-to-text models, and large language models.
This approach is also competitive with strong traditional baselines like gradient-boosted trees.
arXiv Detail & Related papers (2022-10-19T17:08:13Z) - Numeric Encoding Options with Automunge [0.0]
This paper will offer arguments for potential benefits of extended encodings of numeric streams in deep learning.
Proposals are based on options for numeric transformations available in the Automunge open source python library platform.
arXiv Detail & Related papers (2022-02-19T02:21:03Z) - Multilingual training for Software Engineering [0.0]
We present evidence suggesting that human-written code in different languages (which performs the same function) is rather similar.
We study this for 3 different tasks: code summarization, code retrieval, and function naming.
This data-augmenting approach is broadly compatible with different tasks, languages, and machine-learning models.
arXiv Detail & Related papers (2021-12-03T17:47:00Z) - Benchmarking Multimodal AutoML for Tabular Data with Text Fields [83.43249184357053]
We assemble 18 multimodal data tables that each contain some text fields.
Our benchmark enables researchers to evaluate their own methods for supervised learning with numeric, categorical, and text features.
arXiv Detail & Related papers (2021-11-04T09:29:16Z) - Mill.jl and JsonGrinder.jl: automated differentiable feature extraction
for learning from raw JSON data [0.0]
Learning from raw data input is one of the key components of successful applications of machine learning methods.
Learning from raw data input is one of the key components of successful applications of machine learning methods.
arXiv Detail & Related papers (2021-05-19T13:02:10Z) - Data Engineering for HPC with Python [0.0]
Data engineering deals with a variety of data formats, storage, data extraction, transformation, and data movements.
One goal of data engineering is to transform data from original data to vector/matrix/tensor formats accepted by deep learning and machine learning applications.
We present a distributed Python API based on table abstraction for representing and processing data.
arXiv Detail & Related papers (2020-10-13T11:53:11Z) - Scaling Systematic Literature Reviews with Machine Learning Pipelines [57.82662094602138]
Systematic reviews entail the extraction of data from scientific documents.
We construct a pipeline that automates each of these aspects, and experiment with many human-time vs. system quality trade-offs.
We find that we can get surprising accuracy and generalisability of the whole pipeline system with only 2 weeks of human-expert annotation.
arXiv Detail & Related papers (2020-10-09T16:19:42Z) - OPFython: A Python-Inspired Optimum-Path Forest Classifier [68.8204255655161]
This paper proposes a Python-based Optimum-Path Forest framework, denoted as OPFython.
As OPFython is a Python-based library, it provides a more friendly environment and a faster prototyping workspace than the C language.
arXiv Detail & Related papers (2020-01-28T15:46:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.