Leveraging Large Language Model for Automatic Evolving of Industrial
Data-Centric R&D Cycle
- URL: http://arxiv.org/abs/2310.11249v1
- Date: Tue, 17 Oct 2023 13:18:02 GMT
- Title: Leveraging Large Language Model for Automatic Evolving of Industrial
Data-Centric R&D Cycle
- Authors: Xu Yang, Xiao Yang, Weiqing Liu, Jinhui Li, Peng Yu, Zeqi Ye, Jiang
Bian
- Abstract summary: Data-driven solutions are emerging as powerful tools to address multifarious industrial tasks.
Although data-centric R&D has been pivotal in harnessing these solutions, it often comes with significant costs in terms of human, computational, and time resources.
This paper delves into the potential of large language models (LLMs) to expedite the evolution cycle of data-centric R&D.
- Score: 20.30730316993658
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the wake of relentless digital transformation, data-driven solutions are
emerging as powerful tools to address multifarious industrial tasks such as
forecasting, anomaly detection, planning, and even complex decision-making.
Although data-centric R&D has been pivotal in harnessing these solutions, it
often comes with significant costs in terms of human, computational, and time
resources. This paper delves into the potential of large language models (LLMs)
to expedite the evolution cycle of data-centric R&D. Assessing the foundational
elements of data-centric R&D, including heterogeneous task-related data,
multi-facet domain knowledge, and diverse computing-functional tools, we
explore how well LLMs can understand domain-specific requirements, generate
professional ideas, utilize domain-specific tools to conduct experiments,
interpret results, and incorporate knowledge from past endeavors to tackle new
challenges. We take quantitative investment research as a typical example of
industrial data-centric R&D scenario and verified our proposed framework upon
our full-stack open-sourced quantitative research platform Qlib and obtained
promising results which shed light on our vision of automatic evolving of
industrial data-centric R&D cycle.
Related papers
- Data Analysis in the Era of Generative AI [56.44807642944589]
This paper explores the potential of AI-powered tools to reshape data analysis, focusing on design considerations and challenges.
We explore how the emergence of large language and multimodal models offers new opportunities to enhance various stages of data analysis workflow.
We then examine human-centered design principles that facilitate intuitive interactions, build user trust, and streamline the AI-assisted analysis workflow across multiple apps.
arXiv Detail & Related papers (2024-09-27T06:31:03Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey [26.670507323784616]
Large Language Models (LLMs) offer a data-centric solution to alleviate the limitations of real-world data with synthetic data generation.
This paper provides an organization of relevant studies based on a generic workflow of synthetic data generation.
arXiv Detail & Related papers (2024-06-14T07:47:09Z) - IPAD: Industrial Process Anomaly Detection Dataset [71.39058003212614]
Video anomaly detection (VAD) is a challenging task aiming to recognize anomalies in video frames.
We propose a new dataset, IPAD, specifically designed for VAD in industrial scenarios.
This dataset covers 16 different industrial devices and contains over 6 hours of both synthetic and real-world video footage.
arXiv Detail & Related papers (2024-04-23T13:38:01Z) - AI Competitions and Benchmarks: Dataset Development [42.164845505628506]
This chapter provides a comprehensive overview of established methodological tools, enriched by our practical experience.
We develop the tasks involved in dataset development and offer insights into their effective management.
Then, we provide more details about the implementation process which includes data collection, transformation, and quality evaluation.
arXiv Detail & Related papers (2024-04-15T12:01:42Z) - Integration of Domain Expert-Centric Ontology Design into the CRISP-DM for Cyber-Physical Production Systems [45.05372822216111]
Methods from Machine Learning (ML) and Data Mining (DM) have proven to be promising in extracting complex and hidden patterns from the data collected.
However, such data-driven projects, usually performed with the Cross-Industry Standard Process for Data Mining (CRISPDM), often fail due to the disproportionate amount of time needed for understanding and preparing the data.
This contribution intends present an integrated approach so that data scientists are able to more quickly and reliably gain insights into the CPPS challenges.
arXiv Detail & Related papers (2023-07-21T15:04:00Z) - Exploring the Potential of AI-Generated Synthetic Datasets: A Case Study
on Telematics Data with ChatGPT [0.0]
This research delves into the construction and utilization of synthetic datasets, specifically within the telematics sphere, leveraging OpenAI's powerful language model, ChatGPT.
To illustrate this data creation process, a hands-on case study is conducted, focusing on the generation of a synthetic telematics dataset.
arXiv Detail & Related papers (2023-06-23T15:15:13Z) - Semantic Segmentation of Vegetation in Remote Sensing Imagery Using Deep
Learning [77.34726150561087]
We propose an approach for creating a multi-modal and large-temporal dataset comprised of publicly available Remote Sensing data.
We use Convolutional Neural Networks (CNN) models that are capable of separating different classes of vegetation.
arXiv Detail & Related papers (2022-09-28T18:51:59Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Data and its (dis)contents: A survey of dataset development and use in
machine learning research [11.042648980854487]
We survey the many concerns raised about the way we collect and use data in machine learning.
We advocate that a more cautious and thorough understanding of data is necessary to address several of the practical and ethical issues of the field.
arXiv Detail & Related papers (2020-12-09T22:13:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.