Analytical Engines With Context-Rich Processing: Towards Efficient
Next-Generation Analytics
- URL: http://arxiv.org/abs/2212.07517v1
- Date: Wed, 14 Dec 2022 21:46:33 GMT
- Title: Analytical Engines With Context-Rich Processing: Towards Efficient
Next-Generation Analytics
- Authors: Viktor Sanca, Anastasia Ailamaki
- Abstract summary: We envision an analytical engine co-optimized with components that enable context-rich analysis.
We aim for a holistic pipeline cost- and rule-based optimization across relational and model-based operators.
- Score: 12.317930859033149
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As modern data pipelines continue to collect, produce, and store a variety of
data formats, extracting and combining value from traditional and context-rich
sources such as strings, text, video, audio, and logs becomes a manual process
where such formats are unsuitable for RDBMS. To tap into the dark data, domain
experts analyze and extract insights and integrate them into the data
repositories. This process can involve out-of-DBMS, ad-hoc analysis, and
processing resulting in ETL, engineering effort, and suboptimal performance.
While AI systems based on ML models can automate the analysis process, they
often further generate context-rich answers. Using multiple sources of truth,
for either training the models or in the form of knowledge bases, further
exacerbates the problem of consolidating the data of interest.
We envision an analytical engine co-optimized with components that enable
context-rich analysis. Firstly, as the data from different sources or resulting
from model answering cannot be cleaned ahead of time, we propose using online
data integration via model-assisted similarity operations. Secondly, we aim for
a holistic pipeline cost- and rule-based optimization across relational and
model-based operators. Thirdly, with increasingly heterogeneous hardware and
equally heterogeneous workloads ranging from traditional relational analytics
to generative model inference, we envision a system that just-in-time adapts to
the complex analytical query requirements. To solve increasingly complex
analytical problems, ML offers attractive solutions that must be combined with
traditional analytical processing and benefit from decades of database
community research to achieve scalability and performance effortless for the
end user.
Related papers
- Matchmaker: Self-Improving Large Language Model Programs for Schema Matching [60.23571456538149]
We propose a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring.
Matchmaker self-improves in a zero-shot manner without the need for labeled demonstrations.
Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches.
arXiv Detail & Related papers (2024-10-31T16:34:03Z) - LAMBDA: A Large Model Based Data Agent [7.240586338370509]
We introduce LArge Model Based Data Agent (LAMBDA), a novel open-source, code-free multi-agent data analysis system.
LAMBDA is designed to address data analysis challenges in complex data-driven applications.
It has the potential to enhance data analysis paradigms by seamlessly integrating human and artificial intelligence.
arXiv Detail & Related papers (2024-07-24T06:26:36Z) - Towards Next-Generation Urban Decision Support Systems through AI-Powered Construction of Scientific Ontology using Large Language Models -- A Case in Optimizing Intermodal Freight Transportation [1.6230958216521798]
This study investigates the potential of leveraging the pre-trained Large Language Models (LLMs)
By adopting ChatGPT API as the reasoning core, we outline an integrated workflow that encompasses natural language processing, methontology-based prompt tuning, and transformers.
The outcomes of our methodology are knowledge graphs in widely adopted ontology languages (e.g., OWL, RDF, SPARQL)
arXiv Detail & Related papers (2024-05-29T16:40:31Z) - DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation [83.30006900263744]
Data analysis is a crucial analytical process to generate in-depth studies and conclusive insights.
We propose to automatically generate high-quality answer annotations leveraging the code-generation capabilities of LLMs.
Our DACO-RL algorithm is evaluated by human annotators to produce more helpful answers than SFT model in 57.72% cases.
arXiv Detail & Related papers (2024-03-04T22:47:58Z) - Can Large Language Models Serve as Data Analysts? A Multi-Agent Assisted
Approach for Qualitative Data Analysis [6.592797748561459]
Large Language Models (LLMs) have enabled collaborative human-bot interactions in Software Engineering (SE)
We introduce a new dimension of scalability and accuracy in qualitative research, potentially transforming data interpretation methodologies in SE.
arXiv Detail & Related papers (2024-02-02T13:10:46Z) - End-to-End Meta-Bayesian Optimisation with Transformer Neural Processes [52.818579746354665]
This paper proposes the first end-to-end differentiable meta-BO framework that generalises neural processes to learn acquisition functions via transformer architectures.
We enable this end-to-end framework with reinforcement learning (RL) to tackle the lack of labelled acquisition data.
arXiv Detail & Related papers (2023-05-25T10:58:46Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z) - Distributed intelligence on the Edge-to-Cloud Continuum: A systematic
literature review [62.997667081978825]
This review aims at providing a comprehensive vision of the main state-of-the-art libraries and frameworks for machine learning and data analytics available today.
The main simulation, emulation, deployment systems, and testbeds for experimental research on the Edge-to-Cloud Continuum available today are also surveyed.
arXiv Detail & Related papers (2022-04-29T08:06:05Z) - Production Machine Learning Pipelines: Empirical Analysis and
Optimization Opportunities [5.510431861706128]
We analyze provenance graphs of 3000 production ML pipelines at Google, comprising over 450,000 models trained, spanning a period of over four months.
Our analysis reveals the characteristics, components, and topologies of typical industry-strength ML pipelines at various granularities.
We identify several rich opportunities for optimization, leveraging traditional data management ideas.
arXiv Detail & Related papers (2021-03-30T00:46:29Z) - You Only Compress Once: Optimal Data Compression for Estimating Linear
Models [1.2845031126178592]
Many engineering systems that use linear models achieve computational efficiency through distributed systems and expert configuration.
Conditionally sufficient statistics is a unified data compression and estimation strategy.
arXiv Detail & Related papers (2021-02-22T19:00:18Z) - Model-Based Deep Learning [155.063817656602]
Signal processing, communications, and control have traditionally relied on classical statistical modeling techniques.
Deep neural networks (DNNs) use generic architectures which learn to operate from data, and demonstrate excellent performance.
We are interested in hybrid techniques that combine principled mathematical models with data-driven systems to benefit from the advantages of both approaches.
arXiv Detail & Related papers (2020-12-15T16:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.