A Comprehensive Study of Shapley Value in Data Analytics
- URL: http://arxiv.org/abs/2412.01460v3
- Date: Tue, 10 Dec 2024 13:18:55 GMT
- Title: A Comprehensive Study of Shapley Value in Data Analytics
- Authors: Hong Lin, Shixin Wan, Zhongle Xie, Ke Chen, Meihui Zhang, Lidan Shou, Gang Chen,
- Abstract summary: This paper provides the first comprehensive study of Shapley value (SV) used throughout the data analytics (DA) workflow.
We summarize existing versatile forms of SV used in these steps by a unified definition and clarify the essential functionalities that SV can provide for data scientists.
We implement SVBench, the first open-sourced computation benchmark for developing SV applications, and conduct experiments on six DA tasks to validate our analysis and discussions.
- Score: 16.11540350411322
- License:
- Abstract: Over the recent years, Shapley value (SV), a solution concept from cooperative game theory, has found numerous applications in data analytics (DA). This paper provides the first comprehensive study of SV used throughout the DA workflow, which involves three main steps: data fabric, data exploration, and result reporting. We summarize existing versatile forms of SV used in these steps by a unified definition and clarify the essential functionalities that SV can provide for data scientists. We categorize the arts in this field based on the technical challenges they tackled, which include computation efficiency, approximation error, privacy preservation, and appropriate interpretations. We discuss these challenges and analyze the corresponding solutions. We also implement SVBench, the first open-sourced benchmark for developing SV applications, and conduct experiments on six DA tasks to validate our analysis and discussions. Based on the qualitative and quantitative results, we identify the limitations of current efforts for applying SV to DA and highlight the directions of future research and engineering.
Related papers
- Empowering Large Language Models in Wireless Communication: A Novel Dataset and Fine-Tuning Framework [81.29965270493238]
We develop a specialized dataset aimed at enhancing the evaluation and fine-tuning of large language models (LLMs) for wireless communication applications.
The dataset includes a diverse set of multi-hop questions, including true/false and multiple-choice types, spanning varying difficulty levels from easy to hard.
We introduce a Pointwise V-Information (PVI) based fine-tuning method, providing a detailed theoretical analysis and justification for its use in quantifying the information content of training data.
arXiv Detail & Related papers (2025-01-16T16:19:53Z) - Data Analysis in the Era of Generative AI [56.44807642944589]
This paper explores the potential of AI-powered tools to reshape data analysis, focusing on design considerations and challenges.
We explore how the emergence of large language and multimodal models offers new opportunities to enhance various stages of data analysis workflow.
We then examine human-centered design principles that facilitate intuitive interactions, build user trust, and streamline the AI-assisted analysis workflow across multiple apps.
arXiv Detail & Related papers (2024-09-27T06:31:03Z) - InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation [79.09622602860703]
We introduce InsightBench, a benchmark dataset with three key features.
It consists of 100 datasets representing diverse business use cases such as finance and incident management.
Unlike existing benchmarks focusing on answering single queries, InsightBench evaluates agents based on their ability to perform end-to-end data analytics.
arXiv Detail & Related papers (2024-07-08T22:06:09Z) - A Comprehensive Survey on Underwater Image Enhancement Based on Deep Learning [51.7818820745221]
Underwater image enhancement (UIE) presents a significant challenge within computer vision research.
Despite the development of numerous UIE algorithms, a thorough and systematic review is still absent.
arXiv Detail & Related papers (2024-05-30T04:46:40Z) - DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation [83.30006900263744]
Data analysis is a crucial analytical process to generate in-depth studies and conclusive insights.
We propose to automatically generate high-quality answer annotations leveraging the code-generation capabilities of LLMs.
Our DACO-RL algorithm is evaluated by human annotators to produce more helpful answers than SFT model in 57.72% cases.
arXiv Detail & Related papers (2024-03-04T22:47:58Z) - Text2Analysis: A Benchmark of Table Question Answering with Advanced
Data Analysis and Unclear Queries [67.0083902913112]
We develop the Text2Analysis benchmark, incorporating advanced analysis tasks.
We also develop five innovative and effective annotation methods.
We evaluate five state-of-the-art models using three different metrics.
arXiv Detail & Related papers (2023-12-21T08:50:41Z) - Leveraging Large Language Model for Automatic Evolving of Industrial
Data-Centric R&D Cycle [20.30730316993658]
Data-driven solutions are emerging as powerful tools to address multifarious industrial tasks.
Although data-centric R&D has been pivotal in harnessing these solutions, it often comes with significant costs in terms of human, computational, and time resources.
This paper delves into the potential of large language models (LLMs) to expedite the evolution cycle of data-centric R&D.
arXiv Detail & Related papers (2023-10-17T13:18:02Z) - A Comprehensive Survey on Edge Data Integrity Verification: Fundamentals and Future Trends [43.174689394432804]
We show current research status, open problems, and potentially promising insights for readers to further investigate this under-explored field.
To thoroughly assess prior research efforts, we synthesize a universal criteria framework that an effective verification approach should satisfy.
We highlight intriguing research challenges and possible directions for future work, along with a discussion on how forthcoming technology, e.g., machine learning and context-aware security, can augment security in EC.
arXiv Detail & Related papers (2022-10-20T02:58:36Z) - A Survey on Data-driven Software Vulnerability Assessment and
Prioritization [0.0]
Software Vulnerabilities (SVs) are increasing in complexity and scale, posing great security risks to many software systems.
Data-driven techniques such as Machine Learning and Deep Learning have taken SV assessment and prioritization to the next level.
arXiv Detail & Related papers (2021-07-18T04:49:22Z) - Predictive analytics using Social Big Data and machine learning [6.142272540492935]
This chapter sheds the light on core aspects that lay the foundations for social big data analytics.
Various predictive analytical algorithms are introduced with their usage in several important application and top-tier tools and APIs.
arXiv Detail & Related papers (2021-04-21T19:30:45Z) - Data and its (dis)contents: A survey of dataset development and use in
machine learning research [11.042648980854487]
We survey the many concerns raised about the way we collect and use data in machine learning.
We advocate that a more cautious and thorough understanding of data is necessary to address several of the practical and ethical issues of the field.
arXiv Detail & Related papers (2020-12-09T22:13:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.