Related papers: A Comprehensive Study of Shapley Value in Data Analytics

A Comprehensive Study of Shapley Value in Data Analytics

URL: http://arxiv.org/abs/2412.01460v5
Date: Sun, 06 Apr 2025 03:04:37 GMT
Title: A Comprehensive Study of Shapley Value in Data Analytics
Authors: Hong Lin, Shixin Wan, Zhongle Xie, Ke Chen, Meihui Zhang, Lidan Shou, Gang Chen,
Abstract summary: This paper provides the first comprehensive study of Shapley value (SV) used throughout the data analytics (DA) workflow.<n>We condense four primary challenges of using SV in DA, namely computation efficiency, approximation error, privacy preservation, and interpretability.<n>We implement SVBench, a modular and open-sourced framework for developing SV applications in different DA tasks.
Score: 16.11540350411322
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Over the recent years, Shapley value (SV), a solution concept from cooperative game theory, has found numerous applications in data analytics (DA). This paper provides the first comprehensive study of SV used throughout the DA workflow, clarifying the key variables in defining DA-applicable SV and the essential functionalities that SV can provide for data scientists. We condense four primary challenges of using SV in DA, namely computation efficiency, approximation error, privacy preservation, and interpretability, then disentangle the resolution techniques from existing arts in this field, analyze and discuss the techniques w.r.t. each challenge and potential conflicts between challenges. We also implement SVBench, a modular and extensible open-sourced framework for developing SV applications in different DA tasks, and conduct extensive evaluations to validate our analyses and discussions. Based on the qualitative and quantitative results, we identify the limitations of current efforts for applying SV to DA and highlight the directions of future research and engineering.

Related papers

Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study [55.09905978813599]
Large Language Models (LLMs) hold promise in automating data analysis tasks.<n>Yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios.<n>In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs.
arXiv Detail & Related papers (2025-06-24T17:04:23Z)
Anomaly Detection and Generation with Diffusion Models: A Survey [51.61574868316922]
Anomaly detection (AD) plays a pivotal role across diverse domains, including cybersecurity, finance, healthcare, and industrial manufacturing.<n>Recent advancements in deep learning, specifically diffusion models (DMs), have sparked significant interest.<n>This survey aims to guide researchers and practitioners in leveraging DMs for innovative AD solutions across diverse applications.
arXiv Detail & Related papers (2025-06-11T03:29:18Z)
Relationship Detection on Tabular Data Using Statistical Analysis and Large Language Models [4.201987249923826]
This work experiments with a hybrid approach for detecting relationships using a Knowledge Graph (KG) as a reference point, a task known as CPA.<n>This approach leverages large language models (LLMs) while employing statistical analysis to reduce the search space of potential KG relations.<n>The experimental evaluation on two benchmark datasets provided by the SemTab challenge assesses the influence of each module and the effectiveness of different state-of-the-art LLMs.
arXiv Detail & Related papers (2025-06-04T12:11:05Z)
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey [124.23247710880008]
multimodal CoT (MCoT) reasoning has recently garnered significant research attention. Existing MCoT studies design various methodologies to address the challenges of image, video, speech, audio, 3D, and structured data. We present the first systematic survey of MCoT reasoning, elucidating the relevant foundational concepts and definitions.
arXiv Detail & Related papers (2025-03-16T18:39:13Z)
Data Analysis in the Era of Generative AI [56.44807642944589]
This paper explores the potential of AI-powered tools to reshape data analysis, focusing on design considerations and challenges. We explore how the emergence of large language and multimodal models offers new opportunities to enhance various stages of data analysis workflow. We then examine human-centered design principles that facilitate intuitive interactions, build user trust, and streamline the AI-assisted analysis workflow across multiple apps.
arXiv Detail & Related papers (2024-09-27T06:31:03Z)
InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation [79.09622602860703]
We introduce InsightBench, a benchmark dataset with three key features. It consists of 100 datasets representing diverse business use cases such as finance and incident management. Unlike existing benchmarks focusing on answering single queries, InsightBench evaluates agents based on their ability to perform end-to-end data analytics.
arXiv Detail & Related papers (2024-07-08T22:06:09Z)
DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation [83.30006900263744]
Data analysis is a crucial analytical process to generate in-depth studies and conclusive insights. We propose to automatically generate high-quality answer annotations leveraging the code-generation capabilities of LLMs. Our DACO-RL algorithm is evaluated by human annotators to produce more helpful answers than SFT model in 57.72% cases.
arXiv Detail & Related papers (2024-03-04T22:47:58Z)
Key Design Choices in Source-Free Unsupervised Domain Adaptation: An In-depth Empirical Analysis [16.0130560365211]
This study provides a benchmark framework for Source-Free Unsupervised Domain Adaptation (SF-UDA) in image classification. The study empirically examines a diverse set of SF-UDA techniques, assessing their consistency across datasets. It exhaustively evaluates pre-training datasets and strategies, particularly focusing on both supervised and self-supervised methods.
arXiv Detail & Related papers (2024-02-25T13:37:36Z)
Text2Analysis: A Benchmark of Table Question Answering with Advanced Data Analysis and Unclear Queries [67.0083902913112]
We develop the Text2Analysis benchmark, incorporating advanced analysis tasks. We also develop five innovative and effective annotation methods. We evaluate five state-of-the-art models using three different metrics.
arXiv Detail & Related papers (2023-12-21T08:50:41Z)
Leveraging Large Language Model for Automatic Evolving of Industrial Data-Centric R&D Cycle [20.30730316993658]
Data-driven solutions are emerging as powerful tools to address multifarious industrial tasks. Although data-centric R&D has been pivotal in harnessing these solutions, it often comes with significant costs in terms of human, computational, and time resources. This paper delves into the potential of large language models (LLMs) to expedite the evolution cycle of data-centric R&D.
arXiv Detail & Related papers (2023-10-17T13:18:02Z)
Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework [51.44863255495668]
Multimodal reasoning is a critical component in the pursuit of artificial intelligence systems that exhibit human-like intelligence. We present Multi-Modal Reasoning(COCO-MMR) dataset, a novel dataset that encompasses an extensive collection of open-ended questions. We propose innovative techniques, including multi-hop cross-modal attention and sentence-level contrastive learning, to enhance the image and text encoders.
arXiv Detail & Related papers (2023-07-24T08:58:25Z)
A Comprehensive Survey on Source-free Domain Adaptation [69.17622123344327]
The research of Source-Free Domain Adaptation (SFDA) has drawn growing attention in recent years. We provide a comprehensive survey of recent advances in SFDA and organize them into a unified categorization scheme. We compare the results of more than 30 representative SFDA methods on three popular classification benchmarks.
arXiv Detail & Related papers (2023-02-23T06:32:09Z)
GLUECons: A Generic Benchmark for Learning Under Constraints [102.78051169725455]
In this work, we create a benchmark that is a collection of nine tasks in the domains of natural language processing and computer vision. We model external knowledge as constraints, specify the sources of the constraints for each task, and implement various models that use these constraints.
arXiv Detail & Related papers (2023-02-16T16:45:36Z)
A Comprehensive Survey on Edge Data Integrity Verification: Fundamentals and Future Trends [43.174689394432804]
We show current research status, open problems, and potentially promising insights for readers to further investigate this under-explored field. To thoroughly assess prior research efforts, we synthesize a universal criteria framework that an effective verification approach should satisfy. We highlight intriguing research challenges and possible directions for future work, along with a discussion on how forthcoming technology, e.g., machine learning and context-aware security, can augment security in EC.
arXiv Detail & Related papers (2022-10-20T02:58:36Z)
A Survey on Data-driven Software Vulnerability Assessment and Prioritization [0.0]
Software Vulnerabilities (SVs) are increasing in complexity and scale, posing great security risks to many software systems. Data-driven techniques such as Machine Learning and Deep Learning have taken SV assessment and prioritization to the next level.
arXiv Detail & Related papers (2021-07-18T04:49:22Z)
Predictive analytics using Social Big Data and machine learning [6.142272540492935]
This chapter sheds the light on core aspects that lay the foundations for social big data analytics. Various predictive analytical algorithms are introduced with their usage in several important application and top-tier tools and APIs.
arXiv Detail & Related papers (2021-04-21T19:30:45Z)
Data and its (dis)contents: A survey of dataset development and use in machine learning research [11.042648980854487]
We survey the many concerns raised about the way we collect and use data in machine learning. We advocate that a more cautious and thorough understanding of data is necessary to address several of the practical and ethical issues of the field.
arXiv Detail & Related papers (2020-12-09T22:13:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.