Ontologies in CLARIAH: Towards Interoperability in History, Language and
Media
- URL: http://arxiv.org/abs/2004.02845v2
- Date: Fri, 31 Jul 2020 15:34:42 GMT
- Title: Ontologies in CLARIAH: Towards Interoperability in History, Language and
Media
- Authors: Albert Mero\~no-Pe\~nuela, Victor de Boer, Marieke van Erp, Richard
Zijdeman, Rick Mourits, Willem Melder, Auke Rijpma, Ruben Schalk
- Abstract summary: One of the most important goals of digital humanities is to provide researchers with data and tools for new research questions.
The FAIR principles provide a framework as these state that data needs to be: Findable, as they are often scattered among various sources; Accessible, since some might be offline or behind paywalls; Interoperable, thus using standard knowledge representation formats and shared.
We describe the tools developed and integrated in the Dutch national project CLARIAH to address these issues.
- Score: 0.05277024349608833
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One of the most important goals of digital humanities is to provide
researchers with data and tools for new research questions, either by
increasing the scale of scholarly studies, linking existing databases, or
improving the accessibility of data. Here, the FAIR principles provide a useful
framework as these state that data needs to be: Findable, as they are often
scattered among various sources; Accessible, since some might be offline or
behind paywalls; Interoperable, thus using standard knowledge representation
formats and shared vocabularies; and Reusable, through adequate licensing and
permissions. Integrating data from diverse humanities domains is not trivial,
research questions such as "was economic wealth equally distributed in the 18th
century?", or "what are narratives constructed around disruptive media
events?") and preparation phases (e.g. data collection, knowledge organisation,
cleaning) of scholars need to be taken into account. In this chapter, we
describe the ontologies and tools developed and integrated in the Dutch
national project CLARIAH to address these issues across datasets from three
fundamental domains or "pillars" of the humanities (linguistics, social and
economic history, and media studies) that have paradigmatic data
representations (textual corpora, structured data, and multimedia). We
summarise the lessons learnt from using such ontologies and tools in these
domains from a generalisation and reusability perspective.
Related papers
- Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - Data Collection of Real-Life Knowledge Work in Context: The RLKWiC Dataset [4.388282062290401]
This paper presents RLKWiC, a novel dataset of Real-Life Knowledge Work in Context.
RLKWiC is the first publicly available dataset offering a wealth of essential information dimensions.
arXiv Detail & Related papers (2024-04-16T12:23:59Z) - Social Intelligence Data Infrastructure: Structuring the Present and Navigating the Future [59.78608958395464]
We build a Social AI Data Infrastructure, which consists of a comprehensive social AI taxonomy and a data library of 480 NLP datasets.
Our infrastructure allows us to analyze existing dataset efforts, and also evaluate language models' performance in different social intelligence aspects.
We show there is a need for multifaceted datasets, increased diversity in language and culture, more long-tailed social situations, and more interactive data in future social intelligence data efforts.
arXiv Detail & Related papers (2024-02-28T00:22:42Z) - Query of CC: Unearthing Large Scale Domain-Specific Knowledge from
Public Corpora [104.16648246740543]
We propose an efficient data collection method based on large language models.
The method bootstraps seed information through a large language model and retrieves related data from public corpora.
It not only collects knowledge-related data for specific domains but unearths the data with potential reasoning procedures.
arXiv Detail & Related papers (2024-01-26T03:38:23Z) - U-DIADS-Bib: a full and few-shot pixel-precise dataset for document
layout analysis of ancient manuscripts [9.76730765089929]
U-DIADS-Bib is a novel, pixel-precise, non-overlapping and noiseless document layout analysis dataset developed in close collaboration between specialists in the fields of computer vision and humanities.
We propose a novel, computer-aided, segmentation pipeline in order to alleviate the burden represented by the time-consuming process of manual annotation.
arXiv Detail & Related papers (2024-01-16T15:11:18Z) - Federated Learning for Generalization, Robustness, Fairness: A Survey
and Benchmark [55.898771405172155]
Federated learning has emerged as a promising paradigm for privacy-preserving collaboration among different parties.
We provide a systematic overview of the important and recent developments of research on federated learning.
arXiv Detail & Related papers (2023-11-12T06:32:30Z) - DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain
Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge.
Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z) - The Ethics of Social Media Analytics in Migration Studies [1.3651461111887733]
This chapter provides an overview of the ethical considerations of studying migration via social media platforms.
Building on relevant academic literature, we review how the main ethical issues related to social media research have been discussed in the past twenty years.
This overview is designed to provide researchers with theoretical and practical tools to consider and mitigate the ethical challenges related to social media research in migration-related contexts.
arXiv Detail & Related papers (2023-02-28T08:39:22Z) - Algorithmic Fairness Datasets: the Story so Far [68.45921483094705]
Data-driven algorithms are studied in diverse domains to support critical decisions, directly impacting people's well-being.
A growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations.
Progress in fair Machine Learning hinges on data, which can be appropriately used only if adequately documented.
Unfortunately, the algorithmic fairness community suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity)
arXiv Detail & Related papers (2022-02-03T17:25:46Z) - Studying Up Machine Learning Data: Why Talk About Bias When We Mean
Power? [0.0]
We argue that reducing societal problems to "bias" misses the context-based nature of data.
We highlight the corporate forces and market imperatives involved in the labor of data workers that subsequently shape ML datasets.
arXiv Detail & Related papers (2021-09-16T17:38:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.