PoTeC: A German Naturalistic Eye-tracking-while-reading Corpus
- URL: http://arxiv.org/abs/2403.00506v1
- Date: Fri, 1 Mar 2024 13:07:39 GMT
- Title: PoTeC: A German Naturalistic Eye-tracking-while-reading Corpus
- Authors: Deborah N. Jakobi and Thomas Kern and David R. Reich and Patrick
Haller and Lena A. J\"ager
- Abstract summary: The Potsdam Textbook Corpus (PoTeC) is a naturalistic eye-tracking-while-reading corpus containing data from 75 participants reading 12 scientific texts.
PoTeC is the first naturalistic eye-tracking-while-reading corpus that contains eye-movements from domain-experts as well as novices in a within-participant manipulation.
- Score: 0.5922265448902642
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Potsdam Textbook Corpus (PoTeC) is a naturalistic
eye-tracking-while-reading corpus containing data from 75 participants reading
12 scientific texts. PoTeC is the first naturalistic eye-tracking-while-reading
corpus that contains eye-movements from domain-experts as well as novices in a
within-participant manipulation: It is based on a 2x2x2 fully-crossed factorial
design which includes the participants' level of study and the participants'
discipline of study as between-subject factors and the text domain as a
within-subject factor. The participants' reading comprehension was assessed by
a series of text comprehension questions and their domain knowledge was tested
by text-independent background questions for each of the texts. The materials
are annotated for a variety of linguistic features at different levels. We
envision PoTeC to be used for a wide range of studies including but not limited
to analyses of expert and non-expert reading strategies. The corpus and all the
accompanying data at all stages of the preprocessing pipeline and all code used
to preprocess the data are made available via GitHub:
https://github.com/DiLi-Lab/PoTeC.
Related papers
- EMTeC: A Corpus of Eye Movements on Machine-Generated Texts [2.17025619726098]
The Eye Movements on Machine-Generated Texts Corpus (EMTeC) is a naturalistic eye-movements-while-reading corpus of 107 native English speakers reading machine-generated texts.
EMTeC entails the eye movement data at all stages of pre-processing, i.e., the raw coordinate data sampled at 2000 Hz, the fixation sequences, and the reading measures.
arXiv Detail & Related papers (2024-08-08T08:00:45Z) - Interpreting Themes from Educational Stories [9.608135094187912]
We introduce the first dataset specifically designed for interpretive comprehension of educational narratives.
The dataset spans a variety of genres and cultural origins and includes human-annotated theme keywords.
We formulate NLP tasks under different abstractions of interpretive comprehension toward the main idea of a story.
arXiv Detail & Related papers (2024-04-08T07:26:27Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Cloning Ideology and Style using Deep Learning [0.0]
Research focuses on text generation based on the ideology and style of a specific author, and text generation on a topic that was not written by the same author in the past.
Bi-LSTM model is used to make predictions at the character level, during the training corpus of a specific author is used along with the ground truth corpus.
A pre-trained model is used to identify the sentences of ground truth having contradiction with the author's corpus to make our language model inclined.
arXiv Detail & Related papers (2022-10-25T11:37:19Z) - Contextual Text Block Detection towards Scene Text Understanding [85.40898487745272]
This paper presents contextual text detection, a new setup that detects contextual text blocks (CTBs) for better understanding of texts in scenes.
We formulate the new setup by a dual detection task which first detects integral text units and then groups them into a CTB.
To this end, we design a novel scene text clustering technique that treats integral text units as tokens and groups them (belonging to the same CTB) into an ordered token sequence.
arXiv Detail & Related papers (2022-07-26T14:59:25Z) - RuArg-2022: Argument Mining Evaluation [69.87149207721035]
This paper is a report of the organizers on the first competition of argumentation analysis systems dealing with Russian language texts.
A corpus containing 9,550 sentences (comments on social media posts) on three topics related to the COVID-19 pandemic was prepared.
The system that won the first place in both tasks used the NLI (Natural Language Inference) variant of the BERT architecture.
arXiv Detail & Related papers (2022-06-18T17:13:37Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - SCROLLS: Standardized CompaRison Over Long Language Sequences [62.574959194373264]
We introduce SCROLLS, a suite of tasks that require reasoning over long texts.
SCROLLS contains summarization, question answering, and natural language inference tasks.
We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
arXiv Detail & Related papers (2022-01-10T18:47:15Z) - Quasi Error-free Text Classification and Authorship Recognition in a
large Corpus of English Literature based on a Novel Feature Set [0.0]
We show that in the entire GLEC quasi error-free text classification and authorship recognition is possible with a method using the same set of five style and five content features.
Our data pave the way for many future computational and empirical studies of literature or experiments in reading psychology.
arXiv Detail & Related papers (2020-10-21T07:39:55Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.