Related papers: Intelligent Spark Agents: A Modular LangGraph Framework for Scalable, Visualized, and Enhanced Big Data Machine Learning Workflows

Intelligent Spark Agents: A Modular LangGraph Framework for Scalable, Visualized, and Enhanced Big Data Machine Learning Workflows

URL: http://arxiv.org/abs/2412.01490v4
Date: Fri, 06 Dec 2024 13:21:40 GMT
Title: Intelligent Spark Agents: A Modular LangGraph Framework for Scalable, Visualized, and Enhanced Big Data Machine Learning Workflows
Authors: Jialin Wang, Zhihua Duan,
Abstract summary: LangGraph framework is designed to enhance machine learning through scalability, visualization, and intelligent process optimization.<n>At its core, the framework introduces Agent AI, a pivotal innovation that leverages Spark's distributed computing capabilities.<n>The framework also incorporates large language models through the LangChain ecosystem, enhancing interaction with unstructured data.
Score: 1.4582633500696451
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents a Spark-based modular LangGraph framework, designed to enhance machine learning workflows through scalability, visualization, and intelligent process optimization. At its core, the framework introduces Agent AI, a pivotal innovation that leverages Spark's distributed computing capabilities and integrates with LangGraph for workflow orchestration. Agent AI facilitates the automation of data preprocessing, feature engineering, and model evaluation while dynamically interacting with data through Spark SQL and DataFrame agents. Through LangGraph's graph-structured workflows, the agents execute complex tasks, adapt to new inputs, and provide real-time feedback, ensuring seamless decision-making and execution in distributed environments. This system simplifies machine learning processes by allowing users to visually design workflows, which are then converted into Spark-compatible code for high-performance execution. The framework also incorporates large language models through the LangChain ecosystem, enhancing interaction with unstructured data and enabling advanced data analysis. Experimental evaluations demonstrate significant improvements in process efficiency and scalability, as well as accurate data-driven decision-making in diverse application scenarios. This paper emphasizes the integration of Spark with intelligent agents and graph-based workflows to redefine the development and execution of machine learning tasks in big data environments, paving the way for scalable and user-friendly AI solutions.

Related papers

El Agente Gráfico: Structured Execution Graphs for Scientific Agents [7.47895130442454]
We present El Agente Grfico, a single-agent framework that embeds large language models (LLMs)-driven decision-making within a type-safe execution environment.<n>Central to our approach is a structured abstraction of scientific concepts and an object-graph mapper that represents computational state as typed Python objects.<n>We evaluate the system by developing an automated benchmarking framework across a suite of university-level quantum chemistry tasks.
arXiv Detail & Related papers (2026-02-19T23:47:05Z)
Context-Aware Visual Prompting: Automating Geospatial Web Dashboards with Large Language Models and Agent Self-Validation for Decision Support [1.506501956463029]
Development of web-based dashboards for risk analysis and decision making often challenged by difficulty in big, multidimensional data.<n>We introduce a generative AI framework that automates the creation of interactive geospatial dashboards from user-defined inputs.
arXiv Detail & Related papers (2025-10-10T10:58:15Z)
Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments [70.42705564227548]
We propose an automated environment construction pipeline for large language models (LLMs)<n>This enables the creation of high-quality training environments that provide detailed and measurable feedback without relying on external tools.<n>We also introduce a verifiable reward mechanism that evaluates both the precision of tool use and the completeness of task execution.
arXiv Detail & Related papers (2025-08-12T09:45:19Z)
Polymath: A Self-Optimizing Agent with Dynamic Hierarchical Workflow [6.636150750052998]
Large language models (LLMs) excel at solving complex tasks by executing agentic composed of detailed instructions and structured operations.<n>Many researchers have sought to automate the generation and optimization of these through code-based representations.<n>Existing methods often rely on labeled datasets to train and optimize, making them ineffective and inflexible for solving real-world, dynamic problems.
arXiv Detail & Related papers (2025-08-04T23:50:02Z)
Provenance Tracking in Large-Scale Machine Learning Systems [0.0]
y4ML is a tool designed to collect data in a format compliant with the W3C PROV and ProvProvML standards.<n>y4ML is fully integrated with the yProv framework, allowing for higher level pairing in tasks run also through workflow management systems.
arXiv Detail & Related papers (2025-07-01T14:10:02Z)
LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback [121.78866929908871]
Large Action Models (LAMs) for AI Agents offer incredible potential but face challenges due to the need for high-quality training data.<n>We present LAM SIMULATOR, a comprehensive framework designed for online exploration of agentic tasks with high-quality feedback.<n>Our framework features a dynamic task query generator, an extensive collection of tools, and an interactive environment where Large Language Model (LLM) Agents can call tools and receive real-time feedback.
arXiv Detail & Related papers (2025-06-02T22:36:02Z)
Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models [64.28420991770382]
Data-Juicer 2.0 is a data processing system backed by data processing operators spanning text, image, video, and audio modalities.<n>It supports more critical tasks including data analysis, annotation, and foundation model post-training.<n>It has been widely adopted in diverse research fields and real-world products such as Alibaba Cloud PAI.
arXiv Detail & Related papers (2024-12-23T08:29:57Z)
Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining [67.87810796668981]
Information-Sensitive Cropping (ISC) and Self-Refining Dual Learning (SRDL) Iris achieves state-of-the-art performance across multiple benchmarks with only 850K GUI annotations. These improvements translate to significant gains in both web and OS agent downstream tasks.
arXiv Detail & Related papers (2024-12-13T18:40:10Z)
Research on the Application of Spark Streaming Real-Time Data Analysis System and large language model Intelligent Agents [1.4582633500696451]
This study explores the integration of Agent AI with LangGraph to enhance real-time data analysis systems in big data environments. The proposed framework overcomes limitations of static, inefficient stateful computations, and lack of human intervention. System architecture incorporates Apache Spark Streaming, Kafka, and LangGraph to create a high-performance sentiment analysis system.
arXiv Detail & Related papers (2024-12-10T05:51:11Z)
Flex: End-to-End Text-Instructed Visual Navigation with Foundation Models [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies. Our findings are synthesized in Flex (Fly-lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors. We demonstrate the effectiveness of this approach on quadrotor fly-to-target tasks, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z)
Deep Fast Machine Learning Utils: A Python Library for Streamlined Machine Learning Prototyping [0.0]
The Deep Fast Machine Learning Utils (DFMLU) library provides tools designed to automate and enhance aspects of machine learning processes. DFMLU offers functionalities that support model development and data handling. This manuscript presents an overview of DFMLU's functionalities, providing Python examples for each tool.
arXiv Detail & Related papers (2024-09-14T21:39:17Z)
ToolACE: Winning the Points of LLM Function Calling [139.07157814653638]
ToolACE is an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data. We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard.
arXiv Detail & Related papers (2024-09-02T03:19:56Z)
Towards an Integrated Performance Framework for Fire Science and Management Workflows [0.0]
This paper presents an artificial intelligence and machine learning (AI/ML) approach to performance assessment and optimization. An associated early AI/ML framework spanning performance data collection, prediction and optimization is applied to wildfire science applications.
arXiv Detail & Related papers (2024-07-30T22:37:25Z)
AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning [93.96463520716759]
Large language model (LLM) agents have demonstrated impressive capabilities in utilizing external tools and knowledge to boost accuracy and hallucinations. Here, we introduce AvaTaR, a novel and automated framework that optimize an LLM agent to effectively leverage provided tools, improving performance on a given task.
arXiv Detail & Related papers (2024-06-17T04:20:02Z)
CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets [75.64181719386497]
We present CRAFT, a tool creation and retrieval framework for large language models (LLMs) It creates toolsets specifically curated for the tasks and equips LLMs with a component that retrieves tools from these sets to enhance their capability to solve complex tasks. Our method is designed to be flexible and offers a plug-and-play approach to adapt off-the-shelf LLMs to unseen domains and modalities, without any finetuning.
arXiv Detail & Related papers (2023-09-29T17:40:26Z)
A Unified Active Learning Framework for Annotating Graph Data with Application to Software Source Code Performance Prediction [4.572330678291241]
We develop a unified active learning framework specializing in software performance prediction. We investigate the impact of using different levels of information for active and passive learning. Our approach aims to improve the investment in AI models for different software performance predictions.
arXiv Detail & Related papers (2023-04-06T14:00:48Z)
Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data Programming [77.38174112525168]
We present Nemo, an end-to-end interactive Supervision system that improves overall productivity of WS learning pipeline by an average 20% (and up to 47% in one task) compared to the prevailing WS supervision approach.
arXiv Detail & Related papers (2022-03-02T19:57:32Z)
Fine-Tuning Data Structures for Analytical Query Processing [0.5156484100374058]
We introduce a framework for automatically choosing data structures to support efficient computation of analytical workloads. We introduce a novel low-level intermediate language that can express the algorithms behind various query processing paradigms. We show that the performance of the code generated by our framework either outperforms or is on par with the state-of-the-art analytical query engines.
arXiv Detail & Related papers (2021-12-24T16:36:35Z)
SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines. This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z)
Automated Evolutionary Approach for the Design of Composite Machine Learning Pipelines [48.7576911714538]
The proposed approach is aimed to automate the design of composite machine learning pipelines. It designs the pipelines with a customizable graph-based structure, analyzes the obtained results, and reproduces them. The software implementation on this approach is presented as an open-source framework.
arXiv Detail & Related papers (2021-06-26T23:19:06Z)
AutoGL: A Library for Automated Graph Learning [67.63587865669372]
We present Automated Graph Learning (AutoGL), the first dedicated library for automated machine learning on graphs. AutoGL is open-source, easy to use, and flexible to be extended. We also present AutoGL-light, a lightweight version of AutoGL to facilitate customizing pipelines and enriching applications.
arXiv Detail & Related papers (2021-04-11T10:49:23Z)
Learning Discrete Energy-based Models via Auxiliary-variable Local Exploration [130.89746032163106]
We propose ALOE, a new algorithm for learning conditional and unconditional EBMs for discrete structured data. We show that the energy function and sampler can be trained efficiently via a new variational form of power iteration. We present an energy model guided fuzzer for software testing that achieves comparable performance to well engineered fuzzing engines like libfuzzer.
arXiv Detail & Related papers (2020-11-10T19:31:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.