Leveraging Machine Learning to Detect Data Curation Activities
- URL: http://arxiv.org/abs/2105.00030v1
- Date: Fri, 30 Apr 2021 18:17:18 GMT
- Title: Leveraging Machine Learning to Detect Data Curation Activities
- Authors: Sara Lafia, Andrea Thomer, David Bleckley, Dharma Akmon, Libby
Hemphill
- Abstract summary: This paper describes a machine learning approach for annotating and analyzing data curation work logs at ICPSR.
Repository staff use systems to organize, prioritize, and document curation work done on datasets.
A key challenge is classifying similar activities so that they can be measured and associated with impact metrics.
- Score: 1.9949261242626626
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper describes a machine learning approach for annotating and analyzing
data curation work logs at ICPSR, a large social sciences data archive. The
systems we studied track curation work and coordinate team decision-making at
ICPSR. Repository staff use these systems to organize, prioritize, and document
curation work done on datasets, making them promising resources for studying
curation work and its impact on data reuse, especially in combination with data
usage analytics. A key challenge, however, is classifying similar activities so
that they can be measured and associated with impact metrics. This paper
contributes: 1) a schema of data curation activities; 2) a computational model
for identifying curation actions in work log descriptions; and 3) an analysis
of frequent data curation activities at ICPSR over time. We first propose a
schema of data curation actions to help us analyze the impact of curation work.
We then use this schema to annotate a set of data curation logs, which contain
records of data transformations and project management decisions completed by
repository staff. Finally, we train a text classifier to detect the frequency
of curation actions in a large set of work logs. Our approach supports the
analysis of curation work documented in work log systems as an important step
toward studying the relationship between research data curation and data reuse.
Related papers
- In-depth analysis of recall initiators of medical devices with a Machine Learning-Natural language Processing workflow [3.392104905453323]
This study identified, assessed and analysed the medical device recall initiators according to the public medical device recall database from 2018 to 2024.
The results suggest that the unsupervised Density-Based Spatial Clustering of Applications with Noise clustering algorithm can present each single recall initiator in a specific manner.
arXiv Detail & Related papers (2024-06-14T12:38:49Z) - AVIS: Autonomous Visual Information Seeking with Large Language Model
Agent [123.75169211547149]
We propose an autonomous information seeking visual question answering framework, AVIS.
Our method leverages a Large Language Model (LLM) to dynamically strategize the utilization of external tools.
AVIS achieves state-of-the-art results on knowledge-intensive visual question answering benchmarks such as Infoseek and OK-VQA.
arXiv Detail & Related papers (2023-06-13T20:50:22Z) - A Matter of Annotation: An Empirical Study on In Situ and Self-Recall Activity Annotations from Wearable Sensors [56.554277096170246]
We present an empirical study that evaluates and contrasts four commonly employed annotation methods in user studies focused on in-the-wild data collection.
For both the user-driven, in situ annotations, where participants annotate their activities during the actual recording process, and the recall methods, where participants retrospectively annotate their data at the end of each day, the participants had the flexibility to select their own set of activity classes and corresponding labels.
arXiv Detail & Related papers (2023-05-15T16:02:56Z) - Development and validation of a natural language processing algorithm to
pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain.
We annotated a corpus of clinical documents according to 12 types of identifying entities.
We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z) - Is More Data All You Need? A Causal Exploration [4.756600446882457]
Causal analysis is often used in medicine and economics to gain insights about the effects of actions and policies.
In this paper we explore the effect of dataset interventions on the output of image classification models.
arXiv Detail & Related papers (2022-06-06T08:02:54Z) - Transfer Learning in Conversational Analysis through Reusing
Preprocessing Data as Supervisors [52.37504333689262]
Using noisy labels in single-task learning increases the risk of over-fitting.
Auxiliary tasks could improve the performance of the primary task learning during the same training.
arXiv Detail & Related papers (2021-12-02T08:40:42Z) - CLIP: A Dataset for Extracting Action Items for Physicians from Hospital
Discharge Notes [17.107315598110183]
We create a dataset of clinical action items annotated over MIMIC-III, the largest publicly available dataset of real clinical notes.
This dataset, which we call CLIP, is annotated by physicians and covers documents representing 100K sentences.
We describe the task of extracting the action items from these documents as multi-aspect extractive summarization, with each aspect representing a type of action to be taken.
arXiv Detail & Related papers (2021-06-04T14:49:02Z) - Causal Inference for Time series Analysis: Problems, Methods and
Evaluation [11.925605453634638]
Time series data is a collection of chronological observations which is generated by several domains such as medical and financial fields.
We focus on two causal inference tasks, i.e., treatment effect estimation and causal discovery for time series data.
arXiv Detail & Related papers (2021-02-11T03:26:11Z) - Parrot: Data-Driven Behavioral Priors for Reinforcement Learning [79.32403825036792]
We propose a method for pre-training behavioral priors that can capture complex input-output relationships observed in successful trials.
We show how this learned prior can be used for rapidly learning new tasks without impeding the RL agent's ability to try out novel behaviors.
arXiv Detail & Related papers (2020-11-19T18:47:40Z) - ODVICE: An Ontology-Driven Visual Analytic Tool for Interactive Cohort
Extraction [2.0131681387862153]
For uncommon diseases, cohorts extracted from EHRs contain very limited number of records.
We present ODVICE, a data augmentation framework that systematically augments records using a novel ontologically guided Monte-Carlo graph spanning algorithm.
Our results demonstrate the predictive performance of ODVICE augmented cohorts, showing 30% improvement in area under the curve (AUC) over the non-augmented dataset.
arXiv Detail & Related papers (2020-05-13T17:15:51Z) - A Review of Computational Approaches for Evaluation of Rehabilitation
Exercises [58.720142291102135]
This paper reviews computational approaches for evaluating patient performance in rehabilitation programs using motion capture systems.
The reviewed computational methods for exercise evaluation are grouped into three main categories: discrete movement score, rule-based, and template-based approaches.
arXiv Detail & Related papers (2020-02-29T22:18:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.