Related papers: CDS: Data Synthesis Method Guided by Cognitive Diagnosis Theory

Related papers

An Uncertainty-Aware Dynamic Decision Framework for Progressive Multi-Omics Integration in Classification Tasks [6.736267874971369]
We propose an uncertainty-aware, multi-view dynamic decision framework for omics data classification.<n>We employ a fusion strategy based on Dempster-Shafer theory to integrate heterogeneous modalities.<n>In three datasets, over 50% of cases achieved accurate classification using a single omics modality.
arXiv Detail & Related papers (2025-06-20T13:44:14Z)
Synthetic Code Surgery: Repairing Bugs and Vulnerabilities with LLMs and Synthetic Data [0.0]
This paper presents a novel methodology for enhancing Automated Program Repair (APR) through synthetic data generation utilizing Large Language Models (LLMs)<n>The proposed approach addresses this limitation through a two-phase process: a synthetic sample generation followed by a rigorous quality assessment.<n> Experimental evaluation on the VulRepair test set dataset showed statistically significant improvements in Perfect Prediction rates.
arXiv Detail & Related papers (2025-05-12T09:14:20Z)
Prototype-Guided Diffusion for Digital Pathology: Achieving Foundation Model Performance with Minimal Clinical Data [6.318463500874778]
We propose a prototype-guided diffusion model to generate high-fidelity synthetic pathology data at scale. Our approach ensures biologically and diagnostically meaningful variations in the generated data. We demonstrate that self-supervised features trained on our synthetic dataset achieve competitive performance despite using 60x-760x less data than models trained on large real-world datasets.
arXiv Detail & Related papers (2025-04-15T21:17:39Z)
Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z)
Leveraging Large Language Models to Address Data Scarcity in Machine Learning: Applications in Graphene Synthesis [0.0]
Machine learning in materials science faces challenges due to limited experimental data. We propose strategies that utilize large language models (LLMs) to enhance machine learning performance.
arXiv Detail & Related papers (2025-03-06T16:04:01Z)
Step-by-Step Guidance to Differential Anemia Diagnosis with Real-World Data and Deep Reinforcement Learning [1.5272023683653024]
Clinical diagnostic guidelines outline the key questions to answer to reach a diagnosis. We aim to develop a model that learns from electronic health records to determine the optimal sequence of actions for accurate diagnosis.
arXiv Detail & Related papers (2024-12-03T08:45:50Z)
Enhancing Few-Shot Learning with Integrated Data and GAN Model Approaches [35.431340001608476]
This paper presents an innovative approach to enhancing few-shot learning by integrating data augmentation with model fine-tuning. It aims to tackle the challenges posed by small-sample data in fields such as drug discovery, target recognition, and malicious traffic detection. Results confirm that the MhERGAN algorithm developed in this research is highly effective for few-shot learning.
arXiv Detail & Related papers (2024-11-25T16:51:11Z)
Advanced Persistent Threats (APT) Attribution Using Deep Reinforcement Learning [0.0]
The development of the DRL model for malware attribution involved extensive research, iterative coding, and numerous adjustments.<n>The model struggled with low accuracy levels, but through persistent adjustments to its architecture and learning algorithms, accuracy improved dramatically.<n>By the end of the training, the model consistently reached accuracy levels near 98 percent, demonstrating its strong capability to accurately recognise and attribute malware activities.
arXiv Detail & Related papers (2024-10-15T10:10:33Z)
Adversarial Learning for Neural PDE Solvers with Sparse Data [4.226449585713182]
This study introduces a universal learning strategy for neural network PDEs, named Systematic Model Augmentation for Robust Training. By focusing on challenging and improving the model's weaknesses, SMART reduces generalization error during training under data-scarce conditions.
arXiv Detail & Related papers (2024-09-04T04:18:25Z)
What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices [91.71951459594074]
Long language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios.<n>Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement.<n>We propose the Multi-agent Interactive Multi-hop Generation framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent.<n>Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human
arXiv Detail & Related papers (2024-09-03T13:30:00Z)
Towards Effective and Efficient Continual Pre-training of Large Language Models [163.34610964970258]
Continual pre-training (CPT) has been an important approach for adapting language models to specific domains or tasks. This paper presents a technical report for continually pre-training Llama-3 (8B) It significantly enhances the Chinese language ability and scientific reasoning ability of the backbone model.
arXiv Detail & Related papers (2024-07-26T13:55:21Z)
Artificial Intelligence in Extracting Diagnostic Data from Dental Records [6.132077347366551]
This research addresses the issue of missing structured data in dental records by extracting diagnostic information from unstructured text. We use advanced AI and NLP methods, leveraging GPT-4 to generate synthetic notes for fine-tuning a RoBERTa model. We evaluated the model using 120 randomly selected clinical notes from two datasets, demonstrating its improved diagnostic extraction accuracy.
arXiv Detail & Related papers (2024-07-23T04:05:48Z)
Artificial Inductive Bias for Synthetic Tabular Data Generation in Data-Scarce Scenarios [8.062368743143388]
We propose a methodology that integrates artificial inductive biases into the generative process to improve data quality in low-data regimes.<n>We evaluate four approaches (pre-training, model averaging, Model-Agnostic Meta-Learning (MAML), and Domain Search (DRS)) and analyze their impact on the quality of the generated text.<n> Experimental results show that incorporating inductive bias substantially improves performance, with transfer learning methods outperforming meta-learning.
arXiv Detail & Related papers (2024-07-03T12:53:42Z)
Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs) Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws. Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z)
Unified Uncertainty Estimation for Cognitive Diagnosis Models [70.46998436898205]
We propose a unified uncertainty estimation approach for a wide range of cognitive diagnosis models. We decompose the uncertainty of diagnostic parameters into data aspect and model aspect. Our method is effective and can provide useful insights into the uncertainty of cognitive diagnosis.
arXiv Detail & Related papers (2024-03-09T13:48:20Z)
DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation [83.30006900263744]
Data analysis is a crucial analytical process to generate in-depth studies and conclusive insights. We propose to automatically generate high-quality answer annotations leveraging the code-generation capabilities of LLMs. Our DACO-RL algorithm is evaluated by human annotators to produce more helpful answers than SFT model in 57.72% cases.
arXiv Detail & Related papers (2024-03-04T22:47:58Z)
Less is more: Ensemble Learning for Retinal Disease Recognition Under Limited Resources [12.119196313470887]
This paper introduces a novel ensemble learning mechanism designed for recognizing retinal diseases under limited resources. The mechanism leverages insights from multiple pre-trained models, facilitating the transfer and adaptation of their knowledge to Retinal OCT images.
arXiv Detail & Related papers (2024-02-15T06:58:25Z)
An Explainable Deep Learning-Based Method For Schizophrenia Diagnosis Using Generative Data-Augmentation [0.3222802562733786]
We leverage a deep learning-based method for the automatic diagnosis of schizophrenia using EEG brain recordings. This approach utilizes generative data augmentation, a powerful technique that enhances the accuracy of the diagnosis.
arXiv Detail & Related papers (2023-10-25T12:55:16Z)
An Evaluation of Machine Learning Approaches for Early Diagnosis of Autism Spectrum Disorder [0.0]
Autistic Spectrum Disorder (ASD) is a neurological disease characterized by difficulties with social interaction, communication, and repetitive activities. This study employs diverse machine learning methods to identify crucial ASD traits, aiming to enhance and automate the diagnostic process.
arXiv Detail & Related papers (2023-09-20T21:23:37Z)
Robust Learning with Progressive Data Expansion Against Spurious Correlation [65.83104529677234]
We study the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features. Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process. We propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance.
arXiv Detail & Related papers (2023-06-08T05:44:06Z)
GLUECons: A Generic Benchmark for Learning Under Constraints [102.78051169725455]
In this work, we create a benchmark that is a collection of nine tasks in the domains of natural language processing and computer vision. We model external knowledge as constraints, specify the sources of the constraints for each task, and implement various models that use these constraints.
arXiv Detail & Related papers (2023-02-16T16:45:36Z)
Latent Variable Representation for Reinforcement Learning [131.03944557979725]
It remains unclear theoretically and empirically how latent variable models may facilitate learning, planning, and exploration to improve the sample efficiency of model-based reinforcement learning. We provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle. In particular, we propose a computationally efficient planning algorithm with UCB exploration by incorporating kernel embeddings of latent variable models.
arXiv Detail & Related papers (2022-12-17T00:26:31Z)
Discover, Explanation, Improvement: An Automatic Slice Detection Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints. This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks. Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.