Related papers: Conscious Data Contribution via Community-Driven Chain-of-Thought Distillation

Conscious Data Contribution via Community-Driven Chain-of-Thought Distillation

URL: http://arxiv.org/abs/2512.18174v1
Date: Sat, 20 Dec 2025 02:17:18 GMT
Title: Conscious Data Contribution via Community-Driven Chain-of-Thought Distillation
Authors: Lena Libon, Meghana Bhange, Rushabh Solanki, Elliot Creager, Ulrich Aïvodji,
Abstract summary: We consider questions of data portability and user autonomy in the context of LLMs that "reason"<n>We show how communities who receive low utility from an available model can aggregate and distill their shared knowledge into an alternate model better aligned with their goals.
Score: 4.275696286826178
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The current era of AI development places a heavy emphasis on training large models on increasingly scaled-up datasets. This paradigm has catalyzed entirely new product categories, such as LLM chatbots, while also raising concerns about data privacy and consumer choice. In this paper, we consider questions of data portability and user autonomy in the context of LLMs that "reason" using chain-of-thought (CoT) traces, computing intermediate text artifacts from user input before producing a final output. We first interpret recent data privacy and portability law to argue that these intermediate computations qualify as users' personal data. Then, building on the existing framework of Conscious Data Contribution, we show how communities who receive low utility from an available model can aggregate and distill their shared knowledge into an alternate model better aligned with their goals. We verify this approach empirically and investigate the effects of community diversity, reasoning granularity, and community size on distillation performance.

Related papers

The LLM Data Auditor: A Metric-oriented Survey on Quality and Trustworthiness in Evaluating Synthetic Data [25.926467401802046]
Large Language Models (LLMs) have emerged as powerful tools for generating data across various modalities.<n>We propose a framework for evaluating synthetic data from two dimensions: quality and trustworthiness.
arXiv Detail & Related papers (2026-01-25T06:40:25Z)
Learning More with Less: A Generalizable, Self-Supervised Framework for Privacy-Preserving Capacity Estimation with EV Charging Data [84.37348569981307]
We propose a first-of-its-kind capacity estimation model based on self-supervised pre-training.<n>Our model consistently outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2025-10-05T08:58:35Z)
Amputation-imputation based generation of synthetic tabular data for ratemaking [0.0]
Actuarial ratemaking depends on high-quality data, yet access to such data is often limited by the cost of obtaining new data, privacy concerns, etc.<n>In this paper, we explore synthetic-data generation as a potential solution to these issues.<n>We present a comparative study using an open-source dataset and evaluating MICE-based models against other generative models like Variational Autoencoders and Conditional Tabular Generative Adversarial Networks.
arXiv Detail & Related papers (2025-09-02T10:23:04Z)
Non-IID data in Federated Learning: A Survey with Taxonomy, Metrics, Methods, Frameworks and Future Directions [2.9434966603161072]
Federated Learning (FL) enables users to collectively train ML models without sharing private data.<n>FL struggles when data across clients is not independent and identically distributed (non-IID) data.<n>This technical survey aims to fill that gap by providing a detailed taxonomy for non-IID data, partition protocols, and metrics.
arXiv Detail & Related papers (2024-11-19T09:53:28Z)
Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs) We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs. We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z)
On the steerability of large language models toward data-driven personas [98.9138902560793]
Large language models (LLMs) are known to generate biased responses where the opinions of certain groups and populations are underrepresented. Here, we present a novel approach to achieve controllable generation of specific viewpoints using LLMs.
arXiv Detail & Related papers (2023-11-08T19:01:13Z)
Scaling Laws Do Not Scale [54.72120385955072]
Recent work has argued that as the size of a dataset increases, the performance of a model trained on that dataset will increase. We argue that this scaling law relationship depends on metrics used to measure performance that may not correspond with how different groups of people perceive the quality of models' output. Different communities may also have values in tension with each other, leading to difficult, potentially irreconcilable choices about metrics used for model evaluations.
arXiv Detail & Related papers (2023-07-05T15:32:21Z)
Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task. We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z)
Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs. We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z)
DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We provide an open, online platform with multiple rounds of challenges to support this iterative development. The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z)
Interpretabilit\'e des mod\`eles : \'etat des lieux des m\'ethodes et application \`a l'assurance [1.6058099298620423]
Data is the raw material of many models today make it possible to increase the quality and performance of digital services. Models users must ensure that models do not discriminate against and that it is also possible to explain its result. The widening of the panel of predictive algorithms leads scientists to be vigilant about the use of models.
arXiv Detail & Related papers (2020-07-25T12:18:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.