A Novel Corpus of Discourse Structure in Humans and Computers
- URL: http://arxiv.org/abs/2111.05940v1
- Date: Wed, 10 Nov 2021 20:56:08 GMT
- Title: A Novel Corpus of Discourse Structure in Humans and Computers
- Authors: Babak Hemmatian, Sheridan Feucht, Rachel Avram, Alexander Wey, Muskaan
Garg, Kate Spitalnic, Carsten Eickhoff, Ellie Pavlick, Bjorn Sandstede,
Steven Sloman
- Abstract summary: We present a novel corpus of 445 human- and computer-generated documents, comprising about 27,000 clauses.
The corpus covers both formal and informal discourse, and contains documents generated using fine-tuned GPT-2.
- Score: 55.74664144248097
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a novel corpus of 445 human- and computer-generated documents,
comprising about 27,000 clauses, annotated for semantic clause types and
coherence relations that allow for nuanced comparison of artificial and natural
discourse modes. The corpus covers both formal and informal discourse, and
contains documents generated using fine-tuned GPT-2 (Zellers et al., 2019) and
GPT-3(Brown et al., 2020). We showcase the usefulness of this corpus for
detailed discourse analysis of text generation by providing preliminary
evidence that less numerous, shorter and more often incoherent clause relations
are associated with lower perceived quality of computer-generated narratives
and arguments.
Related papers
- The Knesset Corpus: An Annotated Corpus of Hebrew Parliamentary Proceedings [3.2405928866433067]
We present the Corpus Knesset, a corpus of Hebrew parliamentary proceedings from 1998 to 2022.
We show that the corpus can be used to examine historical developments in the style of political discussions.
We also investigate some differences between the styles of men and women speakers.
arXiv Detail & Related papers (2024-05-28T12:23:39Z) - Quantifying the redundancy between prosody and text [67.07817268372743]
We use large language models to estimate how much information is redundant between prosody and the words themselves.
We find a high degree of redundancy between the information carried by the words and prosodic information across several prosodic features.
Still, we observe that prosodic features can not be fully predicted from text, suggesting that prosody carries information above and beyond the words.
arXiv Detail & Related papers (2023-11-28T21:15:24Z) - Synthetically generated text for supervised text analysis [5.71097144710995]
I provide a conceptual overview of text generation, guidance on when researchers should prefer different techniques for generating synthetic text, a discussion of ethics, and a simple technique for improving the quality of synthetic text.
I demonstrate the usefulness of synthetic text with three applications: generating synthetic tweets describing the fighting in Ukraine, synthetic news articles describing specified political events for training an event detection system, and a multilingual corpus of populist manifesto statements for training a sentence-level populism classifier.
arXiv Detail & Related papers (2023-03-28T14:55:13Z) - An Inclusive Notion of Text [69.36678873492373]
We argue that clarity on the notion of text is crucial for reproducible and generalizable NLP.
We introduce a two-tier taxonomy of linguistic and non-linguistic elements that are available in textual sources and can be used in NLP modeling.
arXiv Detail & Related papers (2022-11-10T14:26:43Z) - Discourse Analysis for Evaluating Coherence in Video Paragraph Captions [99.37090317971312]
We are exploring a novel discourse based framework to evaluate the coherence of video paragraphs.
Central to our approach is the discourse representation of videos, which helps in modeling coherence of paragraphs conditioned on coherence of videos.
Our experiment results have shown that the proposed framework evaluates coherence of video paragraphs significantly better than all the baseline methods.
arXiv Detail & Related papers (2022-01-17T04:23:08Z) - How much do language models copy from their training data? Evaluating
linguistic novelty in text generation using RAVEN [63.79300884115027]
Current language models can generate high-quality text.
Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions?
We introduce RAVEN, a suite of analyses for assessing the novelty of generated text.
arXiv Detail & Related papers (2021-11-18T04:07:09Z) - Persian Rhetorical Structure Theory [2.610470075814367]
We present a discourse-annotated corpus for the Persian language built in the framework of Rhetorical Theory.
Our corpus consists of 150 journalistic texts, each text having an average of around 400 words.
Our text-level discourse is trained using gold segmentation and is built upon the DPLP discoursebank.
arXiv Detail & Related papers (2021-06-25T18:15:47Z) - Neural Deepfake Detection with Factual Structure of Text [78.30080218908849]
We propose a graph-based model for deepfake detection of text.
Our approach represents the factual structure of a given document as an entity graph.
Our model can distinguish the difference in the factual structure between machine-generated text and human-written text.
arXiv Detail & Related papers (2020-10-15T02:35:31Z) - A frame semantics based approach to comparative study of digitized
corpus [0.0]
The paper focuses on the morphologic, syntactic, and semantic annotation process of English-Arabic aligned corpus created from a digitized novels.
The present study argues that differences in motion events conceptualization across languages can be described with frame structure and frame-to-frame relations.
arXiv Detail & Related papers (2020-05-29T22:56:25Z) - The Discussion Tracker Corpus of Collaborative Argumentation [2.800857580710507]
The Discussion Tracker corpus was collected in American high school English classes.
The corpus consists of 29 multi-party discussions of English literature transcribed from 985 minutes of audio.
arXiv Detail & Related papers (2020-05-22T18:27:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.