Related papers: Mill.jl and JsonGrinder.jl: automated differentiable feature extraction for learning from raw JSON data

Mill.jl and JsonGrinder.jl: automated differentiable feature extraction for learning from raw JSON data

URL: http://arxiv.org/abs/2105.09107v1
Date: Wed, 19 May 2021 13:02:10 GMT
Title: Mill.jl and JsonGrinder.jl: automated differentiable feature extraction for learning from raw JSON data
Authors: Simon Mandlik, Matej Racinsky, Viliam Lisy, Tomas Pevny
Abstract summary: Learning from raw data input is one of the key components of successful applications of machine learning methods. Learning from raw data input is one of the key components of successful applications of machine learning methods.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Learning from raw data input, thus limiting the need for manual feature engineering, is one of the key components of many successful applications of machine learning methods. While machine learning problems are often formulated on data that naturally translate into a vector representation suitable for classifiers, there are data sources, for example in cybersecurity, that are naturally represented in diverse files with a unifying hierarchical structure, such as XML, JSON, and Protocol Buffers. Converting this data to vector (tensor) representation is generally done by manual feature engineering, which is laborious, lossy, and prone to human bias about the importance of particular features. Mill and JsonGrinder is a tandem of libraries, which fully automates the conversion. Starting with an arbitrary set of JSON samples, they create a differentiable machine learning model capable of infer from further JSON samples in their raw form.

Related papers

Learning to Generate Structured Output with Schema Reinforcement Learning [83.09230124049667]
This study investigates the structured generation capabilities of large language models (LLMs) We find that the latest LLMs are still struggling to generate a valid string. Our models demonstrate significant improvement in both generating outputs and downstream tasks.
arXiv Detail & Related papers (2025-02-26T06:45:29Z)
MSdocTr-Lite: A Lite Transformer for Full Page Multi-script Handwriting Recognition [3.0682439731292592]
We propose a lite transformer architecture for full-page multi-script handwriting recognition. The proposed model comes with three advantages. It can learn the reading order at page-level thanks to a curriculum learning strategy. It can be easily adapted to other scripts by applying a simple transfer-learning process.
arXiv Detail & Related papers (2023-03-24T11:40:50Z)
RT-1: Robotics Transformer for Real-World Control at Scale [98.09428483862165]
We present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties. We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks.
arXiv Detail & Related papers (2022-12-13T18:55:15Z)
Privacy-Preserving Machine Learning for Collaborative Data Sharing via Auto-encoder Latent Space Embeddings [57.45332961252628]
Privacy-preserving machine learning in data-sharing processes is an ever-critical task. This paper presents an innovative framework that uses Representation Learning via autoencoders to generate privacy-preserving embedded data.
arXiv Detail & Related papers (2022-11-10T17:36:58Z)
Explaining Classifiers Trained on Raw Hierarchical Multiple-Instance Data [0.0]
A number of data sources have the natural form of structured data interchange formats (e.g. Multiple security logs in/XML format) Existing methods, such as in Hierarchical Instance Learning (HMIL) allow learning from such data in their raw form. By treating these models as sub-set selections problems, we demonstrate how interpretable explanations, with favourable properties, can be generated using computationally efficient algorithms. We compare to an explanation technique adopted from graph neural networks showing an order of magnitude speed-up and higher-quality explanations.
arXiv Detail & Related papers (2022-08-04T14:48:37Z)
OmniXAI: A Library for Explainable AI [98.07381528393245]
We introduce OmniXAI, an open-source Python library of eXplainable AI (XAI) It offers omni-way explainable AI capabilities and various interpretable machine learning techniques. For practitioners, the library provides an easy-to-use unified interface to generate the explanations for their applications.
arXiv Detail & Related papers (2022-06-01T11:35:37Z)
Pushing the Limits of Simple Pipelines for Few-Shot Learning: External Data and Fine-Tuning Make a Difference [74.80730361332711]
Few-shot learning is an important and topical problem in computer vision. We show that a simple transformer-based pipeline yields surprisingly good performance on standard benchmarks.
arXiv Detail & Related papers (2022-04-15T02:55:58Z)
Sentence Bottleneck Autoencoders from Transformer Language Models [53.350633961266375]
We build a sentence-level autoencoder from a pretrained, frozen transformer language model. We adapt the masked language modeling objective as a generative, denoising one, while only training a sentence bottleneck and a single-layer modified transformer decoder. We demonstrate that the sentence representations discovered by our model achieve better quality than previous methods that extract representations from pretrained transformers on text similarity tasks, style transfer, and single-sentence classification tasks in the GLUE benchmark, while using fewer parameters than large pretrained models.
arXiv Detail & Related papers (2021-08-31T19:39:55Z)
tf.data: A Machine Learning Data Processing Framework [0.4588028371034406]
Training machine learning models requires feeding input data for models to ingest. We present tf.data, a framework for building and executing efficient input pipelines for machine learning jobs. We demonstrate that input pipeline performance is critical to the end-to-end training time of state-of-the-art machine learning models.
arXiv Detail & Related papers (2021-01-28T17:16:46Z)
Scaling Systematic Literature Reviews with Machine Learning Pipelines [57.82662094602138]
Systematic reviews entail the extraction of data from scientific documents. We construct a pipeline that automates each of these aspects, and experiment with many human-time vs. system quality trade-offs. We find that we can get surprising accuracy and generalisability of the whole pipeline system with only 2 weeks of human-expert annotation.
arXiv Detail & Related papers (2020-10-09T16:19:42Z)
An analysis on the use of autoencoders for representation learning: fundamentals, learning task case studies, explainability and challenges [11.329636084818778]
In many machine learning tasks, learning a good representation of the data can be the key to building a well-performant solution. We present a series of learning tasks: data embedding for visualization, image denoising, semantic hashing, detection of abnormal behaviors and instance generation. A solution is proposed for each task employing autoencoders as the only learning method.
arXiv Detail & Related papers (2020-05-21T08:41:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.