Understanding and Improving Data Repurposing
- URL: http://arxiv.org/abs/2506.09073v1
- Date: Mon, 09 Jun 2025 18:33:47 GMT
- Title: Understanding and Improving Data Repurposing
- Authors: J. Parsons, R. Lukyanenko, B. Greenwood, C. Cooper,
- Abstract summary: We live in an age of unprecedented opportunities to use existing data for tasks not anticipated when those data were collected, resulting in widespread data repurposing.<n>We explain how repurposing differs from original data use and data reuse and then develop a framework for data repurposing consisting of concepts and activities for adapting existing data to new tasks.<n>We conclude by suggesting opportunities for research to better understand data repurposing and enable more effective data repurposing practices.
- Score: 0.5892638927736115
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We live in an age of unprecedented opportunities to use existing data for tasks not anticipated when those data were collected, resulting in widespread data repurposing. This commentary defines and maps the scope of data repurposing to highlight its importance for organizations and society and the need to study data repurposing as a frontier of data management. We explain how repurposing differs from original data use and data reuse and then develop a framework for data repurposing consisting of concepts and activities for adapting existing data to new tasks. The framework and its implications are illustrated using two examples of repurposing, one in healthcare and one in citizen science. We conclude by suggesting opportunities for research to better understand data repurposing and enable more effective data repurposing practices.
Related papers
- Machine Learning Data Practices through a Data Curation Lens: An Evaluation Framework [1.5993707490601146]
We evaluate data practices in machine learning as data curation practices.
We find that researchers in machine learning, which often emphasizes model development, struggle to apply standard data curation principles.
arXiv Detail & Related papers (2024-05-04T16:21:05Z) - Retrieval and Distill: A Temporal Data Shift-Free Paradigm for Online Recommendation System [31.594407236146186]
Current recommendation systems are significantly affected by a serious issue of temporal data shift.
Most existing models focus on utilizing updated data, overlooking the transferable, temporal data shift-free information that can be learned from shifting data.
We propose a retrieval-based recommendation system framework that can train a data shift-free relevance network using shifting data.
arXiv Detail & Related papers (2024-04-24T06:16:09Z) - From Data Creator to Data Reuser: Distance Matters [0.847136673632881]
Open science policies focus more heavily on data sharing than on reuse.<n>Both are complex, labor-intensive, expensive, and require infrastructure investments by multiple stakeholders.<n>Value of data reuse lies in relationships between creators and reusers.
arXiv Detail & Related papers (2024-02-05T18:16:04Z) - Enhancing Consistency and Mitigating Bias: A Data Replay Approach for Incremental Learning [93.90047628101155]
Deep learning systems are prone to catastrophic forgetting when learning from a sequence of tasks.<n>To address this, some methods propose replaying data from previous tasks during new task learning.<n>However, it is not expected in practice due to memory constraints and data privacy issues.
arXiv Detail & Related papers (2024-01-12T12:51:12Z) - Adaptive Learning for Service Monitoring Data [0.0]
This study develops an adaptive classification approach using Learn++ that can handle evolving data distributions.
We employ consecutive data chunks obtained from an industrial application to evaluate the performance of the predictors incrementally.
arXiv Detail & Related papers (2022-08-25T18:06:45Z) - A Data-Based Perspective on Transfer Learning [76.30206800557411]
We take a closer look at the role of the source dataset's composition in transfer learning.
Our framework gives rise to new capabilities such as pinpointing transfer learning brittleness.
arXiv Detail & Related papers (2022-07-12T17:58:28Z) - Time and the Value of Data [0.3010893618491329]
Managers often believe that collecting more data will continually improve the accuracy of their machine learning models.
We argue that when data lose relevance over time, it may be optimal to collect a limited amount of recent data instead of keeping around an infinite supply of older (less relevant) data.
arXiv Detail & Related papers (2022-03-17T06:53:46Z) - When Can Models Learn From Explanations? A Formal Framework for
Understanding the Roles of Explanation Data [84.87772675171412]
We study the circumstances under which explanations of individual data points can improve modeling performance.
We make use of three existing datasets with explanations: e-SNLI, TACRED, SemEval.
arXiv Detail & Related papers (2021-02-03T18:57:08Z) - Relation-Guided Representation Learning [53.60351496449232]
We propose a new representation learning method that explicitly models and leverages sample relations.
Our framework well preserves the relations between samples.
By seeking to embed samples into subspace, we show that our method can address the large-scale and out-of-sample problem.
arXiv Detail & Related papers (2020-07-11T10:57:45Z) - Provably Efficient Causal Reinforcement Learning with Confounded
Observational Data [135.64775986546505]
We study how to incorporate the dataset (observational data) collected offline, which is often abundantly available in practice, to improve the sample efficiency in the online setting.
We propose the deconfounded optimistic value iteration (DOVI) algorithm, which incorporates the confounded observational data in a provably efficient manner.
arXiv Detail & Related papers (2020-06-22T14:49:33Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.