From Reflection to Repair: A Scoping Review of Dataset Documentation Tools
- URL: http://arxiv.org/abs/2602.15968v1
- Date: Tue, 17 Feb 2026 19:37:16 GMT
- Title: From Reflection to Repair: A Scoping Review of Dataset Documentation Tools
- Authors: Pedro Reynolds-Cuéllar, Marisol Wong-Villacres, Adriana Alvarado Garcia, Heila Precel,
- Abstract summary: We present a systematic review supported by mixed-methods analysis of 59 dataset documentation publications.<n>Our analysis shows four persistent patterns in dataset documentation conceptualization that potentially impede adoption and standardization.<n>Building on these findings, we propose a shift in Responsible AI tool design toward institutional rather than individual solutions.
- Score: 10.124271544484634
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Dataset documentation is widely recognized as essential for the responsible development of automated systems. Despite growing efforts to support documentation through different kinds of artifacts, little is known about the motivations shaping documentation tool design or the factors hindering their adoption. We present a systematic review supported by mixed-methods analysis of 59 dataset documentation publications to examine the motivations behind building documentation tools, how authors conceptualize documentation practices, and how these tools connect to existing systems, regulations, and cultural norms. Our analysis shows four persistent patterns in dataset documentation conceptualization that potentially impede adoption and standardization: unclear operationalizations of documentation's value, decontextualized designs, unaddressed labor demands, and a tendency to treat integration as future work. Building on these findings, we propose a shift in Responsible AI tool design toward institutional rather than individual solutions, and outline actions the HCI community can take to enable sustainable documentation practices.
Related papers
- Perspectives - Interactive Document Clustering in the Discourse Analysis Tool Suite [20.935269641413694]
Perspectives is a tool suite designed to empower Digital Humanities (DH) scholars to explore and organize large, unstructured document collections.<n> Perspectives implements a flexible, aspect-focused document clustering pipeline with human-in-the-loop refinement capabilities.
arXiv Detail & Related papers (2026-02-17T12:44:05Z) - ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images [19.490609860018804]
We introduce ExStrucTiny, a new benchmark dataset for structured Information Extraction (IE) from document images.<n>Built through a novel pipeline combining manual and synthetic human-validated samples, ExStrucTiny covers more varied document types and extraction scenarios.<n>We analyze open and closed Vision Language Models on this benchmark, highlighting challenges such as adaptation, query under-specification, and schema adaptation.
arXiv Detail & Related papers (2026-02-12T17:38:57Z) - LongDA: Benchmarking LLM Agents for Long-Document Data Analysis [55.32211515932351]
LongDA targets real-world settings in which navigating long documentation and complex data is the primary bottleneck.<n>LongTA is a tool-augmented agent framework that enables document access, retrieval, and code execution.<n>Our experiments reveal substantial performance gaps even among state-of-the-art models.
arXiv Detail & Related papers (2026-01-05T23:23:16Z) - DREAM: Document Reconstruction via End-to-end Autoregressive Model [53.51754520966657]
We present an innovative autoregressive model specifically designed for document reconstruction, referred to as Document Reconstruction via End-to-end Autoregressive Model (DREAM)<n>We establish a standardized definition of the document reconstruction task, and introduce a novel Document Similarity Metric (DSM) and DocRec1K dataset for assessing the performance of the task.
arXiv Detail & Related papers (2025-07-08T09:24:07Z) - From Exploration to Mastery: Enabling LLMs to Master Tools via Self-Driven Interactions [60.733557487886635]
This paper focuses on bridging the comprehension gap between Large Language Models and external tools.<n>We propose a novel framework, DRAFT, aimed at Dynamically Refining tool documentation.<n>This methodology pivots on an innovative trial-and-error approach, consisting of three distinct learning phases.
arXiv Detail & Related papers (2024-10-10T17:58:44Z) - Data Efficient Training of a U-Net Based Architecture for Structured
Documents Localization [0.0]
We propose SDL-Net: a novel U-Net like encoder-decoder architecture for the localization of structured documents.
Our approach allows pre-training the encoder of SDL-Net on a generic dataset containing samples of various document classes.
arXiv Detail & Related papers (2023-10-02T07:05:19Z) - Layout-Aware Information Extraction for Document-Grounded Dialogue:
Dataset, Method and Demonstration [75.47708732473586]
We propose a layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents.
LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents.
Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.
arXiv Detail & Related papers (2022-07-14T07:59:45Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - Focused Attention Improves Document-Grounded Generation [111.42360617630669]
Document grounded generation is the task of using the information provided in a document to improve text generation.
This work focuses on two different document grounded generation tasks: Wikipedia Update Generation task and Dialogue response generation.
arXiv Detail & Related papers (2021-04-26T16:56:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.