Data Combination for Problem-solving: A Case of an Open Data Exchange
Platform
- URL: http://arxiv.org/abs/2012.11746v1
- Date: Mon, 21 Dec 2020 23:29:10 GMT
- Title: Data Combination for Problem-solving: A Case of an Open Data Exchange
Platform
- Authors: Teruaki Hayashi and Hiroki Sakaji and Hiroyasu Matsushima and Yoshiaki
Fukami and Takumi Shimizu and Yukio Ohsawa
- Abstract summary: In big data and interdisciplinary data combinations, large-scale data with many variables are expected to be used.
The results indicate that even datasets that have a few variables are frequently used to propose solutions for problem solving.
The findings of this study shed light on mechanisms behind data combination for problem-solving involving multiple datasets and variables.
- Score: 2.9038508461575976
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In recent years, rather than enclosing data within a single organization,
exchanging and combining data from different domains has become an emerging
practice. Many studies have discussed the economic and utility value of data
and data exchange, but the characteristics of data that contribute to problem
solving through data combination have not been fully understood. In big data
and interdisciplinary data combinations, large-scale data with many variables
are expected to be used, and value is expected to be created by combining data
as much as possible. In this study, we conduct three experiments to investigate
the characteristics of data, focusing on the relationships between data
combinations and variables in each dataset, using empirical data shared by the
local government. The results indicate that even datasets that have a few
variables are frequently used to propose solutions for problem solving.
Moreover, we found that even if the datasets in the solution do not have common
variables, there are some well-established solutions to the problems. The
findings of this study shed light on mechanisms behind data combination for
problem-solving involving multiple datasets and variables.
Related papers
- UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction [93.77809355002591]
We introduce UniTraj, a comprehensive framework that unifies various datasets, models, and evaluation criteria.
We conduct extensive experiments and find that model performance significantly drops when transferred to other datasets.
We provide insights into dataset characteristics to explain these findings.
arXiv Detail & Related papers (2024-03-22T10:36:50Z) - The Dataset Multiplicity Problem: How Unreliable Data Impacts
Predictions [12.00314910031517]
We introduce dataset multiplicity, a way to study how inaccuracies, uncertainty, and social bias in training datasets impact test-time predictions.
We discuss how to use this framework to encapsulate various sources of uncertainty in datasets' factualness.
Our empirical analysis shows that real-world datasets, under reasonable assumptions, contain many test samples whose predictions are affected by dataset multiplicity.
arXiv Detail & Related papers (2023-04-20T21:31:15Z) - Neural Network Architecture for Database Augmentation Using Shared
Features [0.0]
Inherent challenges in some domains such as medicine make it difficult to create large single source datasets or multi-source datasets with identical features.
We propose a neural network architecture that can provide data augmentation using features common between these datasets.
arXiv Detail & Related papers (2023-02-02T19:17:06Z) - A Comprehensive Survey of Dataset Distillation [73.15482472726555]
It has become challenging to handle the unlimited growth of data with limited computing power.
Deep learning technology has developed unprecedentedly in the last decade.
This paper provides a holistic understanding of dataset distillation from multiple aspects.
arXiv Detail & Related papers (2023-01-13T15:11:38Z) - GenSyn: A Multi-stage Framework for Generating Synthetic Microdata using
Macro Data Sources [21.32471030724983]
Individual-level data (microdata) that characterizes a population is essential for studying many real-world problems.
In this study, we examine synthetic data generation as a tool to extrapolate difficult-to-obtain high-resolution data.
arXiv Detail & Related papers (2022-12-08T01:22:12Z) - A Survey of Dataset Refinement for Problems in Computer Vision Datasets [11.45536223418548]
Large-scale datasets have played a crucial role in the advancement of computer vision.
They often suffer from problems such as class imbalance, noisy labels, dataset bias, or high resource costs.
Various data-centric solutions have been proposed to solve the dataset problems.
They improve the quality of datasets by re-organizing them, which we call dataset refinement.
arXiv Detail & Related papers (2022-10-21T03:58:43Z) - Combining datasets to increase the number of samples and improve model
fitting [7.4771091238795595]
We propose a novel framework called Combine datasets based on Imputation (ComImp)
In addition, we propose a variant of ComImp that uses Principle Component Analysis (PCA), PCA-ComImp in order to reduce dimension before combining datasets.
Our results indicate that the proposed methods are somewhat similar to transfer learning in that the merge can significantly improve the accuracy of a prediction model on smaller datasets.
arXiv Detail & Related papers (2022-10-11T06:06:37Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z) - Multi-Source Causal Inference Using Control Variates [81.57072928775509]
We propose a general algorithm to estimate causal effects from emphmultiple data sources.
We show theoretically that this reduces the variance of the ATE estimate.
We apply this framework to inference from observational data under an outcome selection bias.
arXiv Detail & Related papers (2021-03-30T21:20:51Z) - DAIL: Dataset-Aware and Invariant Learning for Face Recognition [67.4903809903022]
To achieve good performance in face recognition, a large scale training dataset is usually required.
It is problematic and troublesome to naively combine different datasets due to two major issues.
Naively treating the same person as different classes in different datasets during training will affect back-propagation.
manually cleaning labels may take formidable human efforts, especially when there are millions of images and thousands of identities.
arXiv Detail & Related papers (2021-01-14T01:59:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.