Measuring Data Collection Diligence for Community Healthcare
- URL: http://arxiv.org/abs/2011.02962v5
- Date: Wed, 7 Apr 2021 15:16:09 GMT
- Title: Measuring Data Collection Diligence for Community Healthcare
- Authors: Ramesha Karunasena, Mohammad Sarparajul Ambiya, Arunesh Sinha, Ruchit
Nagar, Saachi Dalal, Divy Thakkar, Dhyanesh Narayanan, Milind Tambe
- Abstract summary: Non-diligent data collection by community health workers (CHWs) is a significant challenge in developing countries.
In this work, we define and test a data collection diligence score.
Our framework has been validated on the ground using observations by the field monitors of our partner NGO in India.
- Score: 23.612133021992868
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data analytics has tremendous potential to provide targeted benefit in
low-resource communities, however the availability of high-quality public
health data is a significant challenge in developing countries primarily due to
non-diligent data collection by community health workers (CHWs). In this work,
we define and test a data collection diligence score. This challenging
unlabeled data problem is handled by building upon domain expert's guidance to
design a useful data representation of the raw data, using which we design a
simple and natural score. An important aspect of the score is relative scoring
of the CHWs, which implicitly takes into account the context of the local area.
The data is also clustered and interpreting these clusters provides a natural
explanation of the past behavior of each data collector. We further predict the
diligence score for future time steps. Our framework has been validated on the
ground using observations by the field monitors of our partner NGO in India.
Beyond the successful field test, our work is in the final stages of deployment
in the state of Rajasthan, India.
Related papers
- Examining The CoVCues Dataset: Supporting COVID Infodemic Research Through A Novel User Assessment Study [0.0]
We have created a novel dataset called CoVCues that represents a varied set of image artifacts.<n>We have conducted a preliminary user assessment study to determine how effectively these dataset images contribute to the user perceived information reliability.<n>The findings from this study offer valuable feedback for refining our CoVCues dataset and for supporting our claim that visual cues are underutilized but useful in combating the COVID infodemic.
arXiv Detail & Related papers (2026-01-19T20:16:37Z) - Causal-Aware Generative Adversarial Networks with Reinforcement Learning [17.222261383589732]
We introduce CA-GAN, a novel generative framework specifically engineered to address these challenges for real-world datasets.<n>Our method offers a practical, high-performance solution for data engineers seeking to create high-quality, privacy-compliant synthetic datasets.
arXiv Detail & Related papers (2025-10-28T04:02:49Z) - Evaluating Facial Expression Recognition Datasets for Deep Learning: A Benchmark Study with Novel Similarity Metrics [4.137346786534721]
This study investigates the key characteristics and suitability of widely used Facial Expression Recognition (FER) datasets for training deep learning models.
We compiled and analyzed 24 FER datasets, including those targeting specific age groups such as children, adults, and the elderly.
Benchmark experiments using state-of-the-art neural networks reveal that large-scale, automatically collected datasets tend to generalize better.
arXiv Detail & Related papers (2025-03-26T11:01:00Z) - Network Intrusion Datasets: A Survey, Limitations, and Recommendations [0.0]
Data-driven cyberthreat detection has become a crucial defense technique in modern cybersecurity.
Despite the importance of data, its scarcity has long been recognized as a major obstacle in NIDS research.
arXiv Detail & Related papers (2025-02-10T17:14:37Z) - Weak-Annotation of HAR Datasets using Vision Foundation Models [9.948823510429902]
We propose a novel, clustering-based annotation pipeline to significantly reduce the amount of data that needs to be annotated by a human annotator.
We show that using our approach, the annotation of centroid clips suffices to achieve average labelling accuracies close to 90% across three publicly available HAR benchmark datasets.
arXiv Detail & Related papers (2024-08-09T16:46:53Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - Copycats: the many lives of a publicly available medical imaging dataset [12.98380178359767]
Medical Imaging (MI) datasets are fundamental to artificial intelligence in healthcare.
MI datasets used to be proprietary, but have become increasingly available to the public, including on community-contributed platforms (CCPs) like Kaggle or HuggingFace.
While open data is important to enhance the redistribution of data's public value, we find that the current CCP governance model fails to uphold the quality needed and recommended practices for sharing, documenting, and evaluating datasets.
arXiv Detail & Related papers (2024-02-09T12:01:22Z) - Capture the Flag: Uncovering Data Insights with Large Language Models [90.47038584812925]
This study explores the potential of using Large Language Models (LLMs) to automate the discovery of insights in data.
We propose a new evaluation methodology based on a "capture the flag" principle, measuring the ability of such models to recognize meaningful and pertinent information (flags) in a dataset.
arXiv Detail & Related papers (2023-12-21T14:20:06Z) - SDOH-NLI: a Dataset for Inferring Social Determinants of Health from
Clinical Notes [13.991819517682574]
Social and behavioral determinants of health (SDOH) play a significant role in shaping health outcomes.
Progress on using NLP methods for this task has been hindered by the lack of high-quality publicly available labeled data.
This paper introduces a new dataset, SDOH-NLI, that is based on publicly available notes and which we release publicly.
arXiv Detail & Related papers (2023-10-27T19:09:30Z) - Harnessing Administrative Data Inventories to Create a Reliable
Transnational Reference Database for Crop Type Monitoring [0.0]
We showcase E URO C ROPS, a reference dataset for crop type classification that aggregates and harmonizes administrative data surveyed in different countries with the goal of transnational interoperability.
arXiv Detail & Related papers (2023-10-10T07:57:00Z) - Computationally Assisted Quality Control for Public Health Data Streams [21.056027241048152]
FlaSH is a practical outlier detection framework for public health data users.
It uses simple, scalable models to capture statistical properties of public health streams.
It has been deployed on data streams used by public health stakeholders.
arXiv Detail & Related papers (2023-06-29T13:08:12Z) - Does Synthetic Data Generation of LLMs Help Clinical Text Mining? [51.205078179427645]
We investigate the potential of OpenAI's ChatGPT to aid in clinical text mining.
We propose a new training paradigm that involves generating a vast quantity of high-quality synthetic data.
Our method has resulted in significant improvements in the performance of downstream tasks.
arXiv Detail & Related papers (2023-03-08T03:56:31Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - SustainBench: Benchmarks for Monitoring the Sustainable Development
Goals with Machine Learning [63.192289553021816]
Progress toward the United Nations Sustainable Development Goals has been hindered by a lack of data on key environmental and socioeconomic indicators.
Recent advances in machine learning have made it possible to utilize abundant, frequently-updated, and globally available data, such as from satellites or social media.
In this paper, we introduce SustainBench, a collection of 15 benchmark tasks across 7 SDGs.
arXiv Detail & Related papers (2021-11-08T18:59:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.