Related papers: GiesKaNe: Bridging Past and Present in Grammatical Theory and Practical Application

GiesKaNe: Bridging Past and Present in Grammatical Theory and Practical Application

URL: http://arxiv.org/abs/2502.05113v1
Date: Fri, 07 Feb 2025 17:35:33 GMT
Title: GiesKaNe: Bridging Past and Present in Grammatical Theory and Practical Application
Authors: Volker Emmrich,
Abstract summary: Article explores the requirements for corpus compilation within the GiesKaNe project.<n>As a historical corpus, GiesKaNe aims to establish connections with both historical and contemporary corpora.<n>The methodological complexity of such a project is managed through a complementary interplay of human expertise and machine-assisted processes.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This article explores the requirements for corpus compilation within the GiesKaNe project (University of Giessen and Kassel, Syntactic Basic Structures of New High German). The project is defined by three central characteristics: it is a reference corpus, a historical corpus, and a syntactically deeply annotated treebank. As a historical corpus, GiesKaNe aims to establish connections with both historical and contemporary corpora, ensuring its relevance across temporal and linguistic contexts. The compilation process strikes the balance between innovation and adherence to standards, addressing both internal project goals and the broader interests of the research community. The methodological complexity of such a project is managed through a complementary interplay of human expertise and machine-assisted processes. The article discusses foundational topics such as tokenization, normalization, sentence definition, tagging, parsing, and inter-annotator agreement, alongside advanced considerations. These include comparisons between grammatical models, annotation schemas, and established de facto annotation standards as well as the integration of human and machine collaboration. Notably, a novel method for machine-assisted classification of texts along the continuum of conceptual orality and literacy is proposed, offering new perspectives on text selection. Furthermore, the article introduces an approach to deriving de facto standard annotations from existing ones, mediating between standardization and innovation. In the course of describing the workflow the article demonstrates that even ambitious projects like GiesKaNe can be effectively implemented using existing research infrastructure, requiring no specialized annotation tools. Instead, it is shown that the workflow can be based on the strategic use of a simple spreadsheet and integrates the capabilities of the existing infrastructure.

Related papers

Loci Similes: A Benchmark for Extracting Intertextualities in Latin Literature [4.132158161225706]
Loci Similes is a benchmark for Latin intertextuality detection comprising of a curated dataset of 172k text segments containing 545 expert-verified parallels linking Late Antique authors to a corpus of classical authors.<n>We establish baselines for retrieval and classification of intertextualities with state-of-the-art LLMs.
arXiv Detail & Related papers (2026-01-12T13:34:49Z)
Multilingual corpora for the study of new concepts in the social sciences and humanities: [0.0]
This article presents a hybrid methodology for building a multilingual corpus designed to support the study of emerging concepts in the humanities and social sciences.<n>The corpus relies on two complementary sources: (1) textual content automatically extracted from company websites, cleaned for French and English, and (2) annual reports collected and automatically filtered according to documentary criteria (year, format, duplication)<n>The processing pipeline includes automatic language detection, filtering of non-relevant content, extraction of relevant segments, and enrichment with structural metadata.
arXiv Detail & Related papers (2025-12-08T10:04:50Z)
DISRetrieval: Harnessing Discourse Structure for Long Document Retrieval [51.89673002051528]
DISRetrieval is a novel hierarchical retrieval framework that leverages linguistic discourse structure to enhance long document understanding.<n>Our studies confirm that discourse structure significantly enhances retrieval effectiveness across different document lengths and query types.
arXiv Detail & Related papers (2025-05-26T14:45:12Z)
Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation [20.87296508045343]
We introduce Fuxi, a comprehensive benchmark that evaluates both understanding and generation capabilities across 21 diverse tasks. We reveal significant performance gaps between understanding and generation tasks, with models achieving promising results in comprehension but struggling considerably in generation tasks. Our findings highlight the current limitations in ancient Chinese text processing and provide insights for future model development.
arXiv Detail & Related papers (2025-03-20T04:26:40Z)
Re3: A Holistic Framework and Dataset for Modeling Collaborative Document Revision [62.12545440385489]
We introduce Re3, a framework for joint analysis of collaborative document revision. We present Re3-Sci, a large corpus of aligned scientific paper revisions manually labeled according to their action and intent. We use the new data to provide first empirical insights into collaborative document revision in the academic domain.
arXiv Detail & Related papers (2024-05-31T21:19:09Z)
Specifying Genericity through Inclusiveness and Abstractness Continuous Scales [1.024113475677323]
This paper introduces a novel annotation framework for the fine-grained modeling of Noun Phrases' (NPs) genericity in natural language. The framework is designed to be simple and intuitive, making it accessible to non-expert annotators and suitable for crowd-sourced tasks.
arXiv Detail & Related papers (2024-03-22T15:21:07Z)
BBScore: A Brownian Bridge Based Metric for Assessing Text Coherence [20.507596002357655]
Coherent texts inherently manifest a sequential and cohesive interplay among sentences. BBScore is a reference-free metric grounded in Brownian bridge theory for assessing text coherence.
arXiv Detail & Related papers (2023-12-28T08:34:17Z)
Towards Verifiable Generation: A Benchmark for Knowledge-aware Language Model Attribution [48.86322922826514]
This paper defines a new task of Knowledge-aware Language Model Attribution (KaLMA) First, we extend attribution source from unstructured texts to Knowledge Graph (KG), whose rich structures benefit both the attribution performance and working scenarios. Second, we propose a new Conscious Incompetence" setting considering the incomplete knowledge repository. Third, we propose a comprehensive automatic evaluation metric encompassing text quality, citation quality, and text citation alignment.
arXiv Detail & Related papers (2023-10-09T11:45:59Z)
Advancing Topic Segmentation and Outline Generation in Chinese Texts: The Paragraph-level Topic Representation, Corpus, and Benchmark [44.06803331843307]
paragraph-level topic structure can grasp and understand the overall context of a document from a higher level. The lack of large-scale, high-quality Chinese paragraph-level topic structure corpora restrained research and applications. We propose a hierarchical paragraph-level topic structure representation with three layers to guide the corpus construction. We employ a two-stage man-machine collaborative annotation method to construct the largest Chinese paragraph-level Topic Structure corpus.
arXiv Detail & Related papers (2023-05-24T06:43:23Z)
An Inclusive Notion of Text [69.36678873492373]
We argue that clarity on the notion of text is crucial for reproducible and generalizable NLP. We introduce a two-tier taxonomy of linguistic and non-linguistic elements that are available in textual sources and can be used in NLP modeling.
arXiv Detail & Related papers (2022-11-10T14:26:43Z)
O-Dang! The Ontology of Dangerous Speech Messages [53.15616413153125]
We present O-Dang!: The Ontology of Dangerous Speech Messages, a systematic and interoperable Knowledge Graph (KG) O-Dang! is designed to gather and organize Italian datasets into a structured KG, according to the principles shared within the Linguistic Linked Open Data community. It provides a model for encoding both gold standard and single-annotator labels in the KG.
arXiv Detail & Related papers (2022-07-13T11:50:05Z)
Target-aware Abstractive Related Work Generation with Contrastive Learning [48.02845973891943]
The related work section is an important component of a scientific paper, which highlights the contribution of the target paper in the context of the reference papers. Most of the existing related work section generation methods rely on extracting off-the-shelf sentences. We propose an abstractive target-aware related work generator (TAG), which can generate related work sections consisting of new sentences.
arXiv Detail & Related papers (2022-05-26T13:20:51Z)
Revise and Resubmit: An Intertextual Model of Text-based Collaboration in Peer Review [52.359007622096684]
Peer review is a key component of the publishing process in most fields of science. Existing NLP studies focus on the analysis of individual texts. editorial assistance often requires modeling interactions between pairs of texts.
arXiv Detail & Related papers (2022-04-22T16:39:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.