Related papers: From Strings to Data Science: a Practical Framework for Automated String Handling

From Strings to Data Science: a Practical Framework for Automated String Handling

URL: http://arxiv.org/abs/2111.01868v2
Date: Thu, 4 Nov 2021 08:14:12 GMT
Title: From Strings to Data Science: a Practical Framework for Automated String Handling
Authors: John W. van Lith and Joaquin Vanschoren
Abstract summary: Many machine learning libraries require that string features be converted to a numerical representation for the models to work as intended. In this paper, we propose a framework to do so based on best practices, domain knowledge, and novel techniques. It automatically identifies different types of string features, processes them accordingly, and encodes them into numerical representations.
Score: 0.4079265319364249
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Many machine learning libraries require that string features be converted to a numerical representation for the models to work as intended. Categorical string features can represent a wide variety of data (e.g., zip codes, names, marital status), and are notoriously difficult to preprocess automatically. In this paper, we propose a framework to do so based on best practices, domain knowledge, and novel techniques. It automatically identifies different types of string features, processes them accordingly, and encodes them into numerical representations. We also provide an open source Python implementation to automatically preprocess categorical string data in tabular datasets and demonstrate promising results on a wide range of datasets.

Related papers

Vectorizing string entries for data processing on tables: when are larger language models better? [1.0840985826142429]
We study the benefits of language models in 14 analytical tasks on tables. We show that larger language models tend to perform better, but it is useful to fine tune them for embedding purposes.
arXiv Detail & Related papers (2023-12-15T09:23:56Z)
Deepfake audio as a data augmentation technique for training automatic speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio. A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z)
Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering [52.09178018466104]
We introduce Context-Aware Automated Feature Engineering (CAAFE) to generate semantically meaningful features for datasets. Despite being methodologically simple, CAAFE improves performance on 11 out of 14 datasets. We highlight the significance of context-aware solutions that can extend the scope of AutoML systems to semantic AutoML.
arXiv Detail & Related papers (2023-05-05T09:58:40Z)
TabLLM: Few-shot Classification of Tabular Data with Large Language Models [66.03023402174138]
We study the application of large language models to zero-shot and few-shot classification. We evaluate several serialization methods including templates, table-to-text models, and large language models. This approach is also competitive with strong traditional baselines like gradient-boosted trees.
arXiv Detail & Related papers (2022-10-19T17:08:13Z)
Numeric Encoding Options with Automunge [0.0]
This paper will offer arguments for potential benefits of extended encodings of numeric streams in deep learning. Proposals are based on options for numeric transformations available in the Automunge open source python library platform.
arXiv Detail & Related papers (2022-02-19T02:21:03Z)
Multilingual training for Software Engineering [0.0]
We present evidence suggesting that human-written code in different languages (which performs the same function) is rather similar. We study this for 3 different tasks: code summarization, code retrieval, and function naming. This data-augmenting approach is broadly compatible with different tasks, languages, and machine-learning models.
arXiv Detail & Related papers (2021-12-03T17:47:00Z)
Benchmarking Multimodal AutoML for Tabular Data with Text Fields [83.43249184357053]
We assemble 18 multimodal data tables that each contain some text fields. Our benchmark enables researchers to evaluate their own methods for supervised learning with numeric, categorical, and text features.
arXiv Detail & Related papers (2021-11-04T09:29:16Z)
Mill.jl and JsonGrinder.jl: automated differentiable feature extraction for learning from raw JSON data [0.0]
Learning from raw data input is one of the key components of successful applications of machine learning methods. Learning from raw data input is one of the key components of successful applications of machine learning methods.
arXiv Detail & Related papers (2021-05-19T13:02:10Z)
Data Engineering for HPC with Python [0.0]
Data engineering deals with a variety of data formats, storage, data extraction, transformation, and data movements. One goal of data engineering is to transform data from original data to vector/matrix/tensor formats accepted by deep learning and machine learning applications. We present a distributed Python API based on table abstraction for representing and processing data.
arXiv Detail & Related papers (2020-10-13T11:53:11Z)
Scaling Systematic Literature Reviews with Machine Learning Pipelines [57.82662094602138]
Systematic reviews entail the extraction of data from scientific documents. We construct a pipeline that automates each of these aspects, and experiment with many human-time vs. system quality trade-offs. We find that we can get surprising accuracy and generalisability of the whole pipeline system with only 2 weeks of human-expert annotation.
arXiv Detail & Related papers (2020-10-09T16:19:42Z)
OPFython: A Python-Inspired Optimum-Path Forest Classifier [68.8204255655161]
This paper proposes a Python-based Optimum-Path Forest framework, denoted as OPFython. As OPFython is a Python-based library, it provides a more friendly environment and a faster prototyping workspace than the C language.
arXiv Detail & Related papers (2020-01-28T15:46:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.