Personalization of Dataset Retrieval Results using a Metadata-based Data Valuation Method
- URL: http://arxiv.org/abs/2407.15546v1
- Date: Mon, 22 Jul 2024 11:13:07 GMT
- Title: Personalization of Dataset Retrieval Results using a Metadata-based Data Valuation Method
- Authors: Malick Ebiele, Malika Bendechache, Eamonn Clinton, Rob Brennan,
- Abstract summary: We propose a novel data valuation method for a dataset retrieval use case in Ireland's National mapping agency.
By leveraging metadata and a user's preferences, we estimate the personal value of each dataset.
We validated the data value-based ranking against the stakeholders' ranking of the datasets.
- Score: 0.5999777817331317
- License:
- Abstract: In this paper, we propose a novel data valuation method for a Dataset Retrieval (DR) use case in Ireland's National mapping agency. To the best of our knowledge, data valuation has not yet been applied to Dataset Retrieval. By leveraging metadata and a user's preferences, we estimate the personal value of each dataset to facilitate dataset retrieval and filtering. We then validated the data value-based ranking against the stakeholders' ranking of the datasets. The proposed data valuation method and use case demonstrated that data valuation is promising for dataset retrieval. For instance, the outperforming dataset retrieval based on our approach obtained 0.8207 in terms of NDCG@5 (the truncated Normalized Discounted Cumulative Gain at 5). This study is unique in its exploration of a data valuation-based approach to dataset retrieval and stands out because, unlike most existing methods, our approach is validated using the stakeholders ranking of the datasets.
Related papers
- Private, Augmentation-Robust and Task-Agnostic Data Valuation Approach for Data Marketplace [56.78396861508909]
PriArTa is an approach for computing the distance between the distribution of the buyer's existing dataset and the seller's dataset.
PriArTa is communication-efficient, enabling the buyer to evaluate datasets without needing access to the entire dataset from each seller.
arXiv Detail & Related papers (2024-11-01T17:13:14Z) - Data Distribution Valuation [56.71023681599737]
Existing data valuation methods define a value for a discrete dataset.
In many use cases, users are interested in not only the value of the dataset, but that of the distribution from which the dataset was sampled.
We propose a maximum mean discrepancy (MMD)-based valuation method which enables theoretically principled and actionable policies.
arXiv Detail & Related papers (2024-10-06T07:56:53Z) - Truthful Dataset Valuation by Pointwise Mutual Information [28.63827288801458]
We propose a new data valuation method that provably guarantees the following: data providers always maximize their expected score by truthfully reporting their observed data.
Our method, following the paradigm of proper scoring rules, measures the pointwise mutual information (PMI) of the test dataset and the evaluated dataset.
arXiv Detail & Related papers (2024-05-28T15:04:17Z) - Neural Dynamic Data Valuation [4.286118155737111]
We propose a novel data valuation method from the perspective of optimal control, named the neural dynamic data valuation (NDDV)
Our method has solid theoretical interpretations to accurately identify the data valuation via the sensitivity of the data optimal control state.
In addition, we implement a data re-weighting strategy to capture the unique features of data points, ensuring fairness through the interaction between data points and the mean-field states.
arXiv Detail & Related papers (2024-04-30T13:39:26Z) - When is Off-Policy Evaluation (Reward Modeling) Useful in Contextual Bandits? A Data-Centric Perspective [64.73162159837956]
evaluating the value of a hypothetical target policy with only a logged dataset is important but challenging.
We propose DataCOPE, a data-centric framework for evaluating a target policy given a dataset.
Our empirical analysis of DataCOPE in the logged contextual bandit settings using healthcare datasets confirms its ability to evaluate both machine-learning and human expert policies.
arXiv Detail & Related papers (2023-11-23T17:13:37Z) - Sample Complexity of Preference-Based Nonparametric Off-Policy
Evaluation with Deep Networks [58.469818546042696]
We study the sample efficiency of OPE with human preference and establish a statistical guarantee for it.
By appropriately selecting the size of a ReLU network, we show that one can leverage any low-dimensional manifold structure in the Markov decision process.
arXiv Detail & Related papers (2023-10-16T16:27:06Z) - OpenDataVal: a Unified Benchmark for Data Valuation [38.15852021170501]
We introduce OpenDataVal, an easy-to-use and unified benchmark framework for data valuation.
OpenDataVal provides an integrated environment that includes eleven different state-of-the-art data valuation algorithms.
We perform benchmarking analysis using OpenDataVal, quantifying and comparing the efficacy of state-of-the-art data valuation approaches.
arXiv Detail & Related papers (2023-06-18T14:38:29Z) - DataFinder: Scientific Dataset Recommendation from Natural Language
Descriptions [100.52917027038369]
We operationalize the task of recommending datasets given a short natural language description.
To facilitate this task, we build the DataFinder dataset which consists of a larger automatically-constructed training set and a smaller expert-annotated evaluation set.
This system, trained on the DataFinder dataset, finds more relevant search results than existing third-party dataset search engines.
arXiv Detail & Related papers (2023-05-26T05:22:36Z) - LiDAR dataset distillation within bayesian active learning framework:
Understanding the effect of data augmentation [63.20765930558542]
Active learning (AL) has re-gained attention recently to address reduction of annotation costs and dataset size.
This paper performs a principled evaluation of AL based dataset distillation on (1/4th) of the large Semantic-KITTI dataset.
We observe that data augmentation achieves full dataset accuracy using only 60% of samples from the selected dataset configuration.
arXiv Detail & Related papers (2022-02-06T00:04:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.